[DISCUSS] JSON Canonical Extension Type

2022-11-17 Thread Pradeep Gollakota
Hi folks!

I put together this specification for canonicalizing the JSON type in Arrow.

## Introduction
JSON is a widely used text based data interchange format. There are many
use cases where a user has a column whose contents are a JSON encoded
string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical Type][2] are
two such examples.

The JSON specification is defined in [RFC-8259][3]. However, many of the
most popular parsers support non standard extensions. Examples of non
standard extensions to JSON include comments, unquoted keys, trailing
commas, etc.

## Extension Specification
* The name of the extension is `arrow.json`
* The storage type of the extension is `utf8`
* The extension type has no parameters
* The metadata MUST be either empty or a valid JSON object
- There is no canonical metadata
- Implementations MAY include implementation-specific metadata by using
a namespaced key. For example `{"google.bigquery": {"my": "metadata"}}`
* Implementations...
- MUST produce valid UTF-8 encoded text
- SHOULD produce valid standard JSON
- MAY produce valid non-standard JSON
- MUST support parsing standard JSON
- MAY support parsing non standard JSON
- SHOULD pass through contents that they do not understand

## Forward compatibility
In the future we might allow this logical type to annotate a byte storage
type with a different text encoding.  Implementations consuming JSON
logical types should verify this.

[1]:
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type
[2]:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json
[3]: https://datatracker.ietf.org/doc/html/rfc8259


[jira] [Created] (ARROW-17255) Support JSON logical type in Arrow

2022-07-29 Thread Pradeep Gollakota (Jira)
Pradeep Gollakota created ARROW-17255:
-

 Summary: Support JSON logical type in Arrow
 Key: ARROW-17255
 URL: https://issues.apache.org/jira/browse/ARROW-17255
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Archery
Reporter: Pradeep Gollakota


As a BigQuery developer, I would like the Arrow libraries to support the JSON 
logical Type. This would enable us to use the JSON type in the Arrow format of 
our ReadAPI. This would also enable us to use the JSON type to export data from 
BigQuery to Parquet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17255) Support JSON logical type in Arrow

2022-07-29 Thread Pradeep Gollakota (Jira)
Pradeep Gollakota created ARROW-17255:
-

 Summary: Support JSON logical type in Arrow
 Key: ARROW-17255
 URL: https://issues.apache.org/jira/browse/ARROW-17255
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Archery
Reporter: Pradeep Gollakota


As a BigQuery developer, I would like the Arrow libraries to support the JSON 
logical Type. This would enable us to use the JSON type in the Arrow format of 
our ReadAPI. This would also enable us to use the JSON type to export data from 
BigQuery to Parquet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Request for PR Review

2018-01-30 Thread Pradeep Gollakota
Gentle thread bump.

On Thu, Jan 18, 2018 at 4:03 PM, Pradeep Gollakota <pradeep...@gmail.com>
wrote:

> Hi All,
>
> Can one of you review my PR at https://github.com/apache/
> parquet-mr/pull/447 please?
>
> Thanks,
> Pradeep
>


Request for PR Review

2018-01-18 Thread Pradeep Gollakota
Hi All,

Can one of you review my PR at https://github.com/apache/parquet-mr/pull/447
please?

Thanks,
Pradeep


Re: Kafka with Zookeeper behind AWS ELB

2017-07-20 Thread Pradeep Gollakota
Luigi,

I strongly urge you to consider a 5 node ZK deployment. I've always done
that in the past for resiliency during maintenance. In a 3 node cluster,
you can only tolerate one "failure", so if you bring one node down for
maintenance and another node crashes during said maintenance, your ZK
cluster is down. All the deployments I've had were 5 nodes of ZK and 5
nodes of Kafka.

- Pradeep

On Thu, Jul 20, 2017 at 9:12 AM, Luigi Tagliamonte <
luigi.tagliamont...@gmail.com> wrote:

> Yes Andrey,
> you can use an ENI without EIP on AWS if you only want a private address.
>
> After some consideration, I think that growing the zookeeper cluster more
> than 3 nodes is really unlikely so I think that I will attach 3 ENI to 3
> servers in autoscaling and I will configure Kafka in using this 3 IPs.
> In this way I can get rid of the additional ELB/Haproxy layer, if I will
> ever need to grow the zk ensemble I will re-engineering the solution.
>
> I'm wondering if reusing an old IP on a brand new zk node will create
> issues in the ensemble.
> Is anybody here aware of possible drawbacks?
>
> On Wed, Jul 19, 2017 at 11:58 PM, Andrey Dyachkov <
> andrey.dyach...@gmail.com
> > wrote:
>
> > The problem with EIP it is a public ip.
> > Another option is to have the secondary interface attached to the
> instance
> > on start(or a bit later) with the private static ip, but we are
> > investigating the possibility.
> > On Wed 19. Jul 2017 at 23:38, Luigi Tagliamonte <
> > luigi.tagliamont...@gmail.com> wrote:
> >
> > > Hello Andrey,
> > > I see that the ELB is not going to help directly with the bug, but
> > > introduces a nice layer that makes zookeeper DNS management easier.
> > > Introducing and ELB I don't have to deal with keep DNS in sync for all
> > the
> > > servers in the zk ensemble.
> > > For the moment I can use an HAproxy with EIP and when the bug is
> solved I
> > > can move to ELB.
> > > What do you think about it?
> > > Regards
> > > L.
> > >
> > > On Wed, Jul 19, 2017 at 2:16 PM, Andrey Dyachkov <
> > > andrey.dyach...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > > I have just posted almost the same question in dev list.
> > > > Zookeeper client resolves address only once, on start, introducing
> ELB
> > > > won't really help here (ELBs can be replaced, which involved ip
> > change),
> > > > but I am eager to know if there is a solution for that.
> > > >
> > > > On Wed, 19 Jul 2017 at 23:08 Luigi Tagliamonte <
> > > > luigi.tagliamont...@gmail.com> wrote:
> > > >
> > > > > Hello, Users!
> > > > > I'm designing a Kafka deployment on AWS and it's my first time
> > working
> > > > with
> > > > > Kafka and Zookeeper so I've collected a lot of info so far but also
> > > some
> > > > > questions that I would love to submit to a much expert audience
> like
> > > you.
> > > > >
> > > > > I have been experimenting with exhibitor and zookeeper in auto
> > scaling
> > > > > group and the exhibitor orchestration seems to work so far.
> > > > >
> > > > > I was trying to find a way to configure zookeeper servers in Kafka
> > conf
> > > > and
> > > > > do not have to reconfigure them in case a zookeeper node needs to
> be
> > > > > replaced/dies, so i taught of course of using DNS but then I read
> > that
> > > > > zkclient library used by Kafka has this bug:
> > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-2184.
> > > > >
> > > > > So I'm now thinking about using an ELB in front of the zookeeper
> > > cluster.
> > > > > Teorically on how zookeeper client should work there should be no
> > > problem
> > > > > but I'm wondering if any of you used that and how is the outcome?
> > > > >
> > > > --
> > > >
> > > > With great enthusiasm,
> > > > Andrey
> > > >
> > >
> > --
> >
> > With great enthusiasm,
> > Andrey
> >
>


Re: Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All,

It appears that the bottleneck in my job was the EBS volumes. Very high i/o
wait times across the cluster. I was only using 1 volume. Increasing to 4
made it faster.

Thanks,
Pradeep

On Thu, Apr 20, 2017 at 3:12 PM, Pradeep Gollakota <pradeep...@gmail.com>
wrote:

> Hi All,
>
> I have a simple ETL job that reads some data, shuffles it and writes it
> back out. This is running on AWS EMR 5.4.0 using Spark 2.1.0.
>
> After Stage 0 completes and the job starts Stage 1, I see a huge slowdown
> in the job. The CPU usage is low on the cluster, as is the network I/O.
> From the Spark Stats, I see large values for the Shuffle Read Blocked Time.
> As an example, one of my tasks completed in 18 minutes, but spent 15
> minutes waiting for remote reads.
>
> I'm not sure why the shuffle is so slow. Are there things I can do to
> increase the performance of the shuffle?
>
> Thanks,
> Pradeep
>


Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All,

I have a simple ETL job that reads some data, shuffles it and writes it
back out. This is running on AWS EMR 5.4.0 using Spark 2.1.0.

After Stage 0 completes and the job starts Stage 1, I see a huge slowdown
in the job. The CPU usage is low on the cluster, as is the network I/O.
>From the Spark Stats, I see large values for the Shuffle Read Blocked Time.
As an example, one of my tasks completed in 18 minutes, but spent 15
minutes waiting for remote reads.

I'm not sure why the shuffle is so slow. Are there things I can do to
increase the performance of the shuffle?

Thanks,
Pradeep


Re: Scaling up kafka consumers

2017-02-24 Thread Pradeep Gollakota
A single partition can be consumed by at most a single consumer. Consumers
compete to take ownership of a partition. So, in order to gain parallelism
you need to add more partitions.

There is a library that allows multiple consumers to consume from a single
partition https://github.com/gerritjvv/kafka-fast. But I've never used it.

On Fri, Feb 24, 2017 at 7:30 AM, Jakub Stransky 
wrote:

> Hello everyone,
>
> I was reading/checking kafka documentation regarding point-2-point and
> publish subscribe communications patterns in kafka and I am wondering how
> to scale up consumer side in point to point scenario when consuming from
> single kafka topic.
>
> Let say I have a single topic with single partition and I have one node
> where the kafka consumer is running. If I want to scale up my service I add
> another node - which has the same configuration as the first one (topic,
> partition and consumer group id). Those two nodes start competing for
> messages from kafka topic.
>
> What I am not sure in this scenario and is actually subject of my question
> is "*Whether they do get each node unique messages or there is still
> possibility that some messages will be consumed by both nodes etc*".
> Because I can see scenarios that both nodes are started at the same time -
> they gets the same topic offset from zookeeper and started consuming
> messages from that offset. OR am I thinking in a wrong direction?
>
> Thanks
> Jakub
>


Re: Missing min/max statistics in file footer

2017-02-10 Thread Pradeep Gollakota
metadata.getFileMetadata().createdBy() shows this "parquet-mr version
1.9.1-SNAPSHOT (build 2fd62ee4d524c270764e9b91dca72e5cf1a005b7)"

Ignore the 1.9.1-SNAPSHOT... that's my local build as I'm trying to work on
PARQUET-869 <https://issues.apache.org/jira/browse/PARQUET-869>

On Fri, Feb 10, 2017 at 10:17 AM, Lars Volker <l...@cloudera.com> wrote:

> Can you check the value of ParquetMetaData.created_by? Once you have that,
> you should see if it gets filtered by the code in CorruptStatistics.java.
>
> On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
>
> > Data was written with Spark but I'm using the parquet APIs directly for
> > reads. I checked the stats in the footer with the following code.
> >
> > ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
> > ParquetMetadataConverter.NO_FILTER);
> > ColumnPath deviceId = ColumnPath.get("deviceId");
> > metadata.getBlocks().forEach(b -> {
> > if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
> > System.out.println("\nBlockSize = " + b.getTotalByteSize());
> > System.out.println("ComprSize = " + b.getCompressedSize());
> > System.out.println("Num Rows  = " + b.getRowCount());
> > b.getColumns().forEach(c -> {
> > if (c.getPath().equals(deviceId)) {
> > Comparable max = c.getStatistics().genericGetMax();
> > Comparable min = c.getStatistics().genericGetMin();
> > System.out.println("\t" + c.getPath() + " [" + min +
> > ", " + max + "]");
> > }
> > });
> > }
> > });
> >
> >
> > Thanks,
> > Pradeep
> >
> > On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <l...@cloudera.com> wrote:
> >
> > > Hi Pradeep,
> > >
> > > I don't have any experience with using Parquet APIs through Spark. That
> > > being said, there are currently several issues around column
> statistics,
> > > both in the format and in the parquet-mr implementation (PARQUET-686,
> > > PARQUET-839, PARQUET-840).
> > >
> > > However, in your case and depending on the versions involved, you might
> > > also hit PARQUET-251, which can cause statistics for some files to be
> > > ignored. In this context it may be worth to have a look at this file:
> > > https://github.com/apache/parquet-mr/blob/master/
> > > parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java
> > >
> > > How did you check that the statistics are not written to the footer? If
> > you
> > > used parquet-mr, they may be there but be ignored.
> > >
> > > Cheers, Lars
> > >
> > > On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <
> pradeep...@gmail.com
> > >
> > > wrote:
> > >
> > > > Bumping the thread to see if I get any responses.
> > > >
> > > > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
> > pradeep...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > I generated a bunch of parquet files using spark and
> > > > > ParquetThriftOutputFormat. The thirft model has a column called
> > > > "deviceId"
> > > > > which is a string column. It also has a "timestamp" column of
> int64.
> > > > After
> > > > > the files have been generated, I inspected the file footers and
> > noticed
> > > > > that only the "timestamp" field has min/max statistics. My primary
> > > filter
> > > > > will be deviceId, the data is partitioned and sorted by deviceId,
> but
> > > > since
> > > > > the statistics data is missing, it's not able to prune blocks from
> > > being
> > > > > read. Am I missing some configuration setting that allows it to
> > > generate
> > > > > the stats data? The following is code is how an RDD[Thrift] is
> being
> > > > saved
> > > > > to parquet. The configuration is default configuration.
> > > > >
> > > > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > > > ClassTag](rdd: RDD[T]) {
> > > > >   def saveAsParquet(output: String,
> > > > > conf: Configuration = rdd.context.
> > > hadoopConfiguration):
> > > > Unit = {
> > > > > val job = Job.getInstance(conf)
> > > > > val clazz: Class[T] = classTag[T].runtimeClass.
> > > > asInstanceOf[Class[T]]
> > > > > ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > > > > val r = rdd.map[(Void, T)](x => (null, x))
> > > > >   .saveAsNewAPIHadoopFile(
> > > > > output,
> > > > > classOf[Void],
> > > > > clazz,
> > > > > classOf[ParquetThriftOutputFormat[T]],
> > > > > job.getConfiguration)
> > > > >   }
> > > > > }
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Pradeep
> > > > >
> > > >
> > >
> >
>


Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-10 Thread Pradeep Gollakota
I believe if you're calling the .close() method on shutdown, then the
LeaveGroupRequest will be made. If you're doing a kill -9, I'm not sure if
that request will be made.

On Fri, Feb 10, 2017 at 8:47 AM, Praveen <praveev...@gmail.com> wrote:

> @Pradeep - I just read your thread, the 1hr pause was when all the
> consumers where shutdown simultaneously.  I'm testing out rolling restart
> to get the actual numbers. The initial numbers are promising.
>
> `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> REBALANCE
> (takes 1min to get a partition)`
>
> In your thread, Ewen says -
>
> "The LeaveGroupRequest is only sent on a graceful shutdown. If a
> consumer knows it is going to
> shutdown, it is good to proactively make sure the group knows it needs to
> rebalance work because some of the partitions that were handled by the
> consumer need to be handled by some other group members."
>
> So does this mean that the consumer should inform the group ahead of
> time before it goes down? Currently, I just shutdown the process.
>
>
> On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
>
> > I asked a similar question a while ago. There doesn't appear to be a way
> to
> > not triggering the rebalance. But I'm not sure why it would be taking >
> 1hr
> > in your case. For us it was pretty fast.
> >
> > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html
> >
> >
> >
> > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG <
> > krzysztof.lesniew...@nexiot.ch> wrote:
> >
> > > Would be great to get some input on it.
> > >
> > > - Krzysztof Lesniewski
> > >
> > >
> > > On 06.02.2017 08:27, Praveen wrote:
> > >
> > >> I have a 16 broker kafka cluster. There is a topic with 32 partitions
> > >> containing real time data and on the other side, I have 32 boxes w/ 1
> > >> consumer reading from these partitions.
> > >>
> > >> Today our deployment strategy is stop, deploy and start the processes
> on
> > >> all the 32 consumers. This triggers re-balancing and takes a long
> period
> > >> of
> > >> time (> 1hr). Such a long pause isn't good for real time processing.
> > >>
> > >> I was thinking of rolling deploy but I think that will still cause
> > >> re-balancing b/c we will still have consumers go down and come up.
> > >>
> > >> How do you deploy to consumers without triggering re-balancing (or
> > >> triggering one that doesn't affect your SLA) when doing real time
> > >> processing?
> > >>
> > >> Thanks,
> > >> Praveen
> > >>
> > >>
> > >
> >
>


Re: Missing min/max statistics in file footer

2017-02-10 Thread Pradeep Gollakota
Data was written with Spark but I'm using the parquet APIs directly for
reads. I checked the stats in the footer with the following code.

ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
ParquetMetadataConverter.NO_FILTER);
ColumnPath deviceId = ColumnPath.get("deviceId");
metadata.getBlocks().forEach(b -> {
if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
System.out.println("\nBlockSize = " + b.getTotalByteSize());
System.out.println("ComprSize = " + b.getCompressedSize());
System.out.println("Num Rows  = " + b.getRowCount());
b.getColumns().forEach(c -> {
if (c.getPath().equals(deviceId)) {
Comparable max = c.getStatistics().genericGetMax();
Comparable min = c.getStatistics().genericGetMin();
System.out.println("\t" + c.getPath() + " [" + min +
", " + max + "]");
}
});
}
});


Thanks,
Pradeep

On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <l...@cloudera.com> wrote:

> Hi Pradeep,
>
> I don't have any experience with using Parquet APIs through Spark. That
> being said, there are currently several issues around column statistics,
> both in the format and in the parquet-mr implementation (PARQUET-686,
> PARQUET-839, PARQUET-840).
>
> However, in your case and depending on the versions involved, you might
> also hit PARQUET-251, which can cause statistics for some files to be
> ignored. In this context it may be worth to have a look at this file:
> https://github.com/apache/parquet-mr/blob/master/
> parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java
>
> How did you check that the statistics are not written to the footer? If you
> used parquet-mr, they may be there but be ignored.
>
> Cheers, Lars
>
> On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
>
> > Bumping the thread to see if I get any responses.
> >
> > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <pradeep...@gmail.com>
> > wrote:
> >
> > > Hi folks,
> > >
> > > I generated a bunch of parquet files using spark and
> > > ParquetThriftOutputFormat. The thirft model has a column called
> > "deviceId"
> > > which is a string column. It also has a "timestamp" column of int64.
> > After
> > > the files have been generated, I inspected the file footers and noticed
> > > that only the "timestamp" field has min/max statistics. My primary
> filter
> > > will be deviceId, the data is partitioned and sorted by deviceId, but
> > since
> > > the statistics data is missing, it's not able to prune blocks from
> being
> > > read. Am I missing some configuration setting that allows it to
> generate
> > > the stats data? The following is code is how an RDD[Thrift] is being
> > saved
> > > to parquet. The configuration is default configuration.
> > >
> > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > ClassTag](rdd: RDD[T]) {
> > >   def saveAsParquet(output: String,
> > > conf: Configuration = rdd.context.
> hadoopConfiguration):
> > Unit = {
> > > val job = Job.getInstance(conf)
> > > val clazz: Class[T] = classTag[T].runtimeClass.
> > asInstanceOf[Class[T]]
> > > ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > > val r = rdd.map[(Void, T)](x => (null, x))
> > >   .saveAsNewAPIHadoopFile(
> > > output,
> > > classOf[Void],
> > > clazz,
> > > classOf[ParquetThriftOutputFormat[T]],
> > > job.getConfiguration)
> > >   }
> > > }
> > >
> > >
> > > Thanks,
> > > Pradeep
> > >
> >
>


Re: How does one deploy to consumers without causing re-balancing for real time use case?

2017-02-10 Thread Pradeep Gollakota
I asked a similar question a while ago. There doesn't appear to be a way to
not triggering the rebalance. But I'm not sure why it would be taking > 1hr
in your case. For us it was pretty fast.

https://www.mail-archive.com/users@kafka.apache.org/msg23925.html



On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG <
krzysztof.lesniew...@nexiot.ch> wrote:

> Would be great to get some input on it.
>
> - Krzysztof Lesniewski
>
>
> On 06.02.2017 08:27, Praveen wrote:
>
>> I have a 16 broker kafka cluster. There is a topic with 32 partitions
>> containing real time data and on the other side, I have 32 boxes w/ 1
>> consumer reading from these partitions.
>>
>> Today our deployment strategy is stop, deploy and start the processes on
>> all the 32 consumers. This triggers re-balancing and takes a long period
>> of
>> time (> 1hr). Such a long pause isn't good for real time processing.
>>
>> I was thinking of rolling deploy but I think that will still cause
>> re-balancing b/c we will still have consumers go down and come up.
>>
>> How do you deploy to consumers without triggering re-balancing (or
>> triggering one that doesn't affect your SLA) when doing real time
>> processing?
>>
>> Thanks,
>> Praveen
>>
>>
>


Re: Missing min/max statistics in file footer

2017-02-10 Thread Pradeep Gollakota
Bumping the thread to see if I get any responses.

On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <pradeep...@gmail.com>
wrote:

> Hi folks,
>
> I generated a bunch of parquet files using spark and
> ParquetThriftOutputFormat. The thirft model has a column called "deviceId"
> which is a string column. It also has a "timestamp" column of int64. After
> the files have been generated, I inspected the file footers and noticed
> that only the "timestamp" field has min/max statistics. My primary filter
> will be deviceId, the data is partitioned and sorted by deviceId, but since
> the statistics data is missing, it's not able to prune blocks from being
> read. Am I missing some configuration setting that allows it to generate
> the stats data? The following is code is how an RDD[Thrift] is being saved
> to parquet. The configuration is default configuration.
>
> implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] : ClassTag](rdd: 
> RDD[T]) {
>   def saveAsParquet(output: String,
> conf: Configuration = rdd.context.hadoopConfiguration): 
> Unit = {
> val job = Job.getInstance(conf)
> val clazz: Class[T] = classTag[T].runtimeClass.asInstanceOf[Class[T]]
> ParquetThriftOutputFormat.setThriftClass(job, clazz)
> val r = rdd.map[(Void, T)](x => (null, x))
>   .saveAsNewAPIHadoopFile(
> output,
> classOf[Void],
> clazz,
> classOf[ParquetThriftOutputFormat[T]],
> job.getConfiguration)
>   }
> }
>
>
> Thanks,
> Pradeep
>


Missing min/max statistics in file footer

2017-02-08 Thread Pradeep Gollakota
Hi folks,

I generated a bunch of parquet files using spark and
ParquetThriftOutputFormat. The thirft model has a column called "deviceId"
which is a string column. It also has a "timestamp" column of int64. After
the files have been generated, I inspected the file footers and noticed
that only the "timestamp" field has min/max statistics. My primary filter
will be deviceId, the data is partitioned and sorted by deviceId, but since
the statistics data is missing, it's not able to prune blocks from being
read. Am I missing some configuration setting that allows it to generate
the stats data? The following is code is how an RDD[Thrift] is being saved
to parquet. The configuration is default configuration.

implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
ClassTag](rdd: RDD[T]) {
  def saveAsParquet(output: String,
conf: Configuration =
rdd.context.hadoopConfiguration): Unit = {
val job = Job.getInstance(conf)
val clazz: Class[T] = classTag[T].runtimeClass.asInstanceOf[Class[T]]
ParquetThriftOutputFormat.setThriftClass(job, clazz)
val r = rdd.map[(Void, T)](x => (null, x))
  .saveAsNewAPIHadoopFile(
output,
classOf[Void],
clazz,
classOf[ParquetThriftOutputFormat[T]],
job.getConfiguration)
  }
}


Thanks,
Pradeep


Re: Unable to compile thrift

2017-02-08 Thread Pradeep Gollakota
After modifying the failing line to appropriately use reflection, the
following command successfully built parquet-mr with thrift 0.9.3

LC_ALL=C mvn install -DskipTests -Dthrift.version=0.9.3
-Dsun.zip.disableMemoryMapping=true

Not sure if I actually need the sun.zip.disableMemoryMapping property set
to true or not. The first build failed with a SIGSEGV and looking into it,
found this JDK bug https://bugs.openjdk.java.net/browse/JDK-8168632. For
the second build I set the property and it succeded.

That should unblock me, thanks for you help guys!
​

On Wed, Feb 8, 2017 at 12:23 PM, Lars Volker <l...@cloudera.com> wrote:

> I remember trying to compile with the latest version of thrift shipped in
> Ubuntu 14.04 a few weeks back and got the same error. Using 0.7 worked
> though. Sadly I don't know why it fails on a Mac.
>
> On Feb 8, 2017 21:18, "Pradeep Gollakota" <pradeep...@gmail.com> wrote:
>
> > I tried building with -Dthrift.version=0.9.3 and the build still failed
> > with the following error:
> >
> > [ERROR] Failed to execute goal
> > org.apache.maven.plugins:maven-compiler-plugin:3.1:compile
> > (default-compile) on project parquet-thrift: Compilation failure
> > [ERROR] /Users/pgollakota/workspace/apache/parquet-mr/parquet-
> > thrift/src/main/java/org/apache/parquet/hadoop/thrift/
> > ThriftBytesWriteSupport.java:[144,34]
> > cannot find symbol
> > [ERROR] symbol:   method setReadLength(int)
> > [ERROR] location: class org.apache.thrift.protocol.TBinaryProtocol
> > [ERROR] -> [Help 1]
> > [ERROR]
> >
> > So it looks like the source is not compatible?
> > ​
> >
> > On Wed, Feb 8, 2017 at 12:14 PM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > hi Pradeep -- you can use Thrift 0.7 or higher (the instructions say
> > > "0.7+", perhaps we should call this out more explicitly). I recommend
> > > building Thrift 0.9.3 or 0.10 -- let us know if you have issues with
> > > these
> > >
> > > Thanks
> > > Wes
> > >
> > > On Wed, Feb 8, 2017 at 2:19 PM, Pradeep Gollakota <
> pradeep...@gmail.com>
> > > wrote:
> > > > Hi folks,
> > > >
> > > > I'm trying to build parquet from source. However, the instructions
> call
> > > for
> > > > the installation of thrift-0.7.0. I'm developing on a mac and that
> > > version
> > > > of thrift does not compile. I get the following error when I run
> make:
> > > >
> > > > ```
> > > > make
> > > > /Library/Developer/CommandLineTools/usr/bin/make  all-recursive
> > > > Making all in compiler/cpp
> > > > /Library/Developer/CommandLineTools/usr/bin/make  all-am
> > > > make[3]: Nothing to be done for `all-am'.
> > > > Making all in lib
> > > > Making all in cpp
> > > > Making all in .
> > > > if /bin/sh ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H
> > -I.
> > > > -I. -I../..  -I/usr/local/include -I./src  -Wall -g -O2 -MT
> > > > ThreadManager.lo -MD -MP -MF ".deps/ThreadManager.Tpo" -c -o
> > > > ThreadManager.lo `test -f 'src/concurrency/ThreadManager.cpp' ||
> echo
> > > > './'`src/concurrency/ThreadManager.cpp; \
> > > > then mv -f ".deps/ThreadManager.Tpo" ".deps/ThreadManager.Plo"; else
> rm
> > > -f
> > > > ".deps/ThreadManager.Tpo"; exit 1; fi
> > > >  g++ -DHAVE_CONFIG_H -I. -I. -I../.. -I/usr/local/include -I./src
> -Wall
> > > -g
> > > > -O2 -MT ThreadManager.lo -MD -MP -MF .deps/ThreadManager.Tpo -c
> > > > src/concurrency/ThreadManager.cpp  -fno-common -DPIC -o
> > > > .libs/ThreadManager.o
> > > > In file included from src/concurrency/ThreadManager.cpp:20:
> > > > src/concurrency/ThreadManager.h:24:10: fatal error: 'tr1/functional'
> > > file
> > > > not found
> > > > #include 
> > > >  ^
> > > > 1 error generated.
> > > > make[4]: *** [ThreadManager.lo] Error 1
> > > > make[3]: *** [all-recursive] Error 1
> > > > make[2]: *** [all-recursive] Error 1
> > > > make[1]: *** [all-recursive] Error 1
> > > > make: *** [all] Error 2
> > > > ```
> > > >
> > > > I'm going through thrift jira's to see if I can find any workarounds,
> > > but I
> > > > was wondering if any of you have faced this issue and have gotten
> > around
> > > it.
> > > >
> > > > Thanks,
> > > > Pradeep
> > >
> >
>


Re: Unable to compile thrift

2017-02-08 Thread Pradeep Gollakota
I tried building with -Dthrift.version=0.9.3 and the build still failed
with the following error:

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile
(default-compile) on project parquet-thrift: Compilation failure
[ERROR] 
/Users/pgollakota/workspace/apache/parquet-mr/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftBytesWriteSupport.java:[144,34]
cannot find symbol
[ERROR] symbol:   method setReadLength(int)
[ERROR] location: class org.apache.thrift.protocol.TBinaryProtocol
[ERROR] -> [Help 1]
[ERROR]

So it looks like the source is not compatible?
​

On Wed, Feb 8, 2017 at 12:14 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Pradeep -- you can use Thrift 0.7 or higher (the instructions say
> "0.7+", perhaps we should call this out more explicitly). I recommend
> building Thrift 0.9.3 or 0.10 -- let us know if you have issues with
> these
>
> Thanks
> Wes
>
> On Wed, Feb 8, 2017 at 2:19 PM, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
> > Hi folks,
> >
> > I'm trying to build parquet from source. However, the instructions call
> for
> > the installation of thrift-0.7.0. I'm developing on a mac and that
> version
> > of thrift does not compile. I get the following error when I run make:
> >
> > ```
> > make
> > /Library/Developer/CommandLineTools/usr/bin/make  all-recursive
> > Making all in compiler/cpp
> > /Library/Developer/CommandLineTools/usr/bin/make  all-am
> > make[3]: Nothing to be done for `all-am'.
> > Making all in lib
> > Making all in cpp
> > Making all in .
> > if /bin/sh ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I.
> > -I. -I../..  -I/usr/local/include -I./src  -Wall -g -O2 -MT
> > ThreadManager.lo -MD -MP -MF ".deps/ThreadManager.Tpo" -c -o
> > ThreadManager.lo `test -f 'src/concurrency/ThreadManager.cpp' || echo
> > './'`src/concurrency/ThreadManager.cpp; \
> > then mv -f ".deps/ThreadManager.Tpo" ".deps/ThreadManager.Plo"; else rm
> -f
> > ".deps/ThreadManager.Tpo"; exit 1; fi
> >  g++ -DHAVE_CONFIG_H -I. -I. -I../.. -I/usr/local/include -I./src -Wall
> -g
> > -O2 -MT ThreadManager.lo -MD -MP -MF .deps/ThreadManager.Tpo -c
> > src/concurrency/ThreadManager.cpp  -fno-common -DPIC -o
> > .libs/ThreadManager.o
> > In file included from src/concurrency/ThreadManager.cpp:20:
> > src/concurrency/ThreadManager.h:24:10: fatal error: 'tr1/functional'
> file
> > not found
> > #include 
> >  ^
> > 1 error generated.
> > make[4]: *** [ThreadManager.lo] Error 1
> > make[3]: *** [all-recursive] Error 1
> > make[2]: *** [all-recursive] Error 1
> > make[1]: *** [all-recursive] Error 1
> > make: *** [all] Error 2
> > ```
> >
> > I'm going through thrift jira's to see if I can find any workarounds,
> but I
> > was wondering if any of you have faced this issue and have gotten around
> it.
> >
> > Thanks,
> > Pradeep
>


[jira] [Created] (PARQUET-869) Min/Max record counts for block size checks are not configurable

2017-02-07 Thread Pradeep Gollakota (JIRA)
Pradeep Gollakota created PARQUET-869:
-

 Summary: Min/Max record counts for block size checks are not 
configurable
 Key: PARQUET-869
 URL: https://issues.apache.org/jira/browse/PARQUET-869
 Project: Parquet
  Issue Type: Improvement
Reporter: Pradeep Gollakota


While the min/max record counts for page size check are configurable via 
ParquetOutputFormat.MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK and 
ParquetOutputFormat.MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK configs and via 
ParquetProperties directly, the min/max record counts for block size check are 
hard coded inside InternalParquetRecordWriter.

These two settings should also be configurable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Pradeep Gollakota
Usually this kind of thing can be done at a lower level in the InputFormat
usually by specifying the max split size. Have you looked into that
possibility with your InputFormat?

On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu  wrote:

> Hi Jasbir,
>
> Yes, you are right. Do you have any idea about my question?
>
> Thanks,
> Fei
>
> On Mon, Jan 16, 2017 at 12:37 AM,  wrote:
>
>> Hi,
>>
>>
>>
>> Coalesce is used to decrease the number of partitions. If you give the
>> value of numPartitions greater than the current partition, I don’t think
>> RDD number of partitions will be increased.
>>
>>
>>
>> Thanks,
>>
>> Jasbir
>>
>>
>>
>> *From:* Fei Hu [mailto:hufe...@gmail.com]
>> *Sent:* Sunday, January 15, 2017 10:10 PM
>> *To:* zouz...@cs.toronto.edu
>> *Cc:* user @spark ; dev@spark.apache.org
>> *Subject:* Re: Equally split a RDD partition into two partition at the
>> same node
>>
>>
>>
>> Hi Anastasios,
>>
>>
>>
>> Thanks for your reply. If I just increase the numPartitions to be twice
>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
>> the data locality? Do I need to define my own Partitioner?
>>
>>
>>
>> Thanks,
>>
>> Fei
>>
>>
>>
>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias 
>> wrote:
>>
>> Hi Fei,
>>
>>
>>
>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>
>>
>>
>> https://github.com/apache/spark/blob/branch-1.6/core/src/
>> main/scala/org/apache/spark/rdd/RDD.scala#L395
>> 
>>
>>
>>
>> coalesce is mostly used for reducing the number of partitions before
>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>> requirements) if you increase the # of partitions.
>>
>>
>>
>> Best,
>>
>> Anastasios
>>
>>
>>
>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu  wrote:
>>
>> Dear all,
>>
>>
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>>
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>>
>>
>> Thanks in advance,
>>
>> Fei
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> -- Anastasios Zouzias
>>
>>
>>
>> --
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>> 
>> __
>>
>> www.accenture.com
>>
>
>


Re: Consumer Rebalancing Question

2017-01-06 Thread Pradeep Gollakota
What I mean by "flapping" in this context is unnecessary rebalancing
happening. The example I would give is what a Hadoop Datanode would do in
case of a shutdown. By default, it will wait 10 minutes before replicating
the blocks owned by the Datanode so routine maintenance wouldn't cause
unnecessary shuffling of blocks.

In this context, if I'm performing a rolling restart, as soon as worker 1
shuts down, it's work is picked up by other workers. But worker 1 comes
back 3 seconds (or whatever) later and requests the work back. Then worker
2 goes down and it's work is assigned to other workers for 3 seconds before
yet another rebalance. So, in theory, the order of operations will look
something like this:

STOP (1) -> REBALANCE -> START (1) -> REBALANCE -> STOP (2) -> REBALANCE ->
START (2) -> REBALANCE -> 

>From what I understand, there's currently no way to prevent this type of
shuffling of partitions from worker to worker while the consumers are under
maintenance. I'm also not sure if this an issue I don't need to worry about.

- Pradeep

On Thu, Jan 5, 2017 at 8:29 PM, Ewen Cheslack-Postava <e...@confluent.io>
wrote:

> Not sure I understand your question about flapping. The LeaveGroupRequest
> is only sent on a graceful shutdown. If a consumer knows it is going to
> shutdown, it is good to proactively make sure the group knows it needs to
> rebalance work because some of the partitions that were handled by the
> consumer need to be handled by some other group members.
>
> There's no "flapping" in the sense that the leave group requests should
> just inform the other members that they need to take over some of the work.
> I would normally think of "flapping" as meaning that things start/stop
> unnecessarily. In this case, *someone* needs to deal with the rebalance and
> pick up the work being dropped by the worker. There's no flapping because
> it's a one-time event -- one worker is shutting down, decides to drop the
> work, and a rebalance sorts it out and reassigns it to another member of
> the group. This happens once and then the "issue" is resolved without any
> additional interruptions.
>
> -Ewen
>
> On Thu, Jan 5, 2017 at 3:01 PM, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
>
> > I see... doesn't that cause flapping though?
> >
> > On Wed, Jan 4, 2017 at 8:22 PM, Ewen Cheslack-Postava <e...@confluent.io
> >
> > wrote:
> >
> > > The coordinator will immediately move the group into a rebalance if it
> > > needs it. The reason LeaveGroupRequest was added was to avoid having to
> > > wait for the session timeout before completing a rebalance. So aside
> from
> > > the latency of cleanup/committing offests/rejoining after a heartbeat,
> > > rolling bounces should be fast for consumer groups.
> > >
> > > -Ewen
> > >
> > > On Wed, Jan 4, 2017 at 5:19 PM, Pradeep Gollakota <
> pradeep...@gmail.com>
> > > wrote:
> > >
> > > > Hi Kafka folks!
> > > >
> > > > When a consumer is closed, it will issue a LeaveGroupRequest. Does
> > anyone
> > > > know how long the coordinator waits before reassigning the partitions
> > > that
> > > > were assigned to the leaving consumer to a new consumer? I ask
> because
> > > I'm
> > > > trying to understand the behavior of consumers if you're doing a
> > rolling
> > > > restart.
> > > >
> > > > Thanks!
> > > > Pradeep
> > > >
> > >
> >
>


Consumer Rebalancing Question

2017-01-04 Thread Pradeep Gollakota
Hi Kafka folks!

When a consumer is closed, it will issue a LeaveGroupRequest. Does anyone
know how long the coordinator waits before reassigning the partitions that
were assigned to the leaving consumer to a new consumer? I ask because I'm
trying to understand the behavior of consumers if you're doing a rolling
restart.

Thanks!
Pradeep


Re: Spark Website

2016-07-13 Thread Pradeep Gollakota
Worked for me if I go to https://spark.apache.org/site/ but not
https://spark.apache.org

On Wed, Jul 13, 2016 at 11:48 AM, Maurin Lenglart 
wrote:

> Same here
>
>
>
> *From: *Benjamin Kim 
> *Date: *Wednesday, July 13, 2016 at 11:47 AM
> *To: *manish ranjan 
> *Cc: *user 
> *Subject: *Re: Spark Website
>
>
>
> It takes me to the directories instead of the webpage.
>
>
>
> On Jul 13, 2016, at 11:45 AM, manish ranjan  wrote:
>
>
>
> working for me. What do you mean 'as supposed to'?
>
>
> ~Manish
>
>
>
> On Wed, Jul 13, 2016 at 11:45 AM, Benjamin Kim  wrote:
>
> Has anyone noticed that the spark.apache.org is not working as supposed
> to?
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>


Re: kafka + autoscaling groups fuckery

2016-06-28 Thread Pradeep Gollakota
Just out of curiosity, if you guys are in AWS for everything, why not use
Kinesis?

On Tue, Jun 28, 2016 at 3:49 PM, Charity Majors  wrote:

> Hi there,
>
> I just finished implementing kafka + autoscaling groups in a way that made
> sense to me.  I have a _lot_ of experience with ASGs and various storage
> types but I'm a kafka noob (about 4-5 months of using in development and
> staging and pre-launch production).
>
> It seems to be working fine from the Kafka POV but causing troubling side
> effects elsewhere that I don't understand.  I don't know enough about Kafka
> to know if my implementation is just fundamentally flawed for some reason,
> or if so how and why.
>
> My process is basically this:
>
> - *Terminate a node*, or increment the size of the ASG by one.  (I'm not
> doing any graceful shutdowns because I don't want to rely on graceful
> shutdowns, and I'm not attempting to act upon more than one node at a
> time.  Planning on doing a ZK lock or something later to enforce one
> process at a time, if I can work the major kinks out.)
>
> - *Firstboot script,* which runs on all hosts from rc.init.  (We run ASGs
> for *everything.)  It infers things like the chef role, environment,
> cluster name, etc, registers DNS, bootstraps and runs chef-client, etc.
> For storage nodes, it formats and mounts a PIOPS volume under the right
> mount point, or just remounts the volume if it already contains data.  Etc.
>
> - *Run a balancing script from firstboot* on kafka nodes.  It checks to
> see how many brokers there are and what their ids are, and checks for any
> underbalanced partitions with less than 3 ISRs.  Then we generate a new
> assignment file for rebalancing partitions, and execute it.  We watch on
> the host for all the partitions to finish rebalancing, then complete.
>
> *- So far so good*.  I have repeatedly killed kafka nodes and had them
> come up, rebalance the cluster, and everything on the kafka side looks
> healthy.  All the partitions have the correct number of ISRs, etc.
>
> But after doing this, we have repeatedly gotten into a state where
> consumers that are pulling off the kafka partitions enter a weird state
> where their last known offset is *ahead* of the last known offset for that
> partition, and we can't recover from it.
>
> *A example.*  Last night I terminated ... I think it was broker 1002 or
> 1005, and it came back up as broker 1009.  It rebalanced on boot,
> everything looked good from the kafka side.  This morning we noticed that
> the storage node that maps to partition 5 has been broken for like 22
> hours, it thinks the next offset is too far ahead / out of bounds so
> stopped consuming.  This happened shortly after broker 1009 came online and
> the consumer caught up.
>
> From the storage node log:
>
> time="2016-06-28T21:51:48.286035635Z" level=info msg="Serving at
> 0.0.0.0:8089..."
> time="2016-06-28T21:51:48.293946529Z" level=error msg="Error creating
> consumer" error="kafka server: The requested offset is outside the range of
> offsets maintained by the server for the given topic/partition."
> time="2016-06-28T21:51:48.294532365Z" level=error msg="Failed to start
> services: kafka server: The requested offset is outside the range of
> offsets maintained by the server for the given topic/partition."
> time="2016-06-28T21:51:48.29461156Z" level=info msg="Shutting down..."
>
> From the mysql mapping of partitions to storage nodes/statuses:
>
> PRODUCTION ubuntu@retriever-112c6d8d:/srv/hound/retriever/log$
> hound-kennel
>
> Listing by default. Use -action  setstate, addslot, removeslot, removenode> for other actions
>
> PartStatus  Last UpdatedHostname
> 0   live2016-06-28 22:29:10 + UTC   retriever-772045ec
> 1   live2016-06-28 22:29:29 + UTC   retriever-75e0e4f2
> 2   live2016-06-28 22:29:25 + UTC   retriever-78804480
> 3   live2016-06-28 22:30:01 + UTC   retriever-c0da5f85
> 4   live2016-06-28 22:29:42 + UTC   retriever-122c6d8e
> 5   2016-06-28 21:53:48 + UTC
>
>
> PRODUCTION ubuntu@retriever-112c6d8d:/srv/hound/retriever/log$
> hound-kennel -partition 5 -action nextoffset
>
> Next offset for partition 5: 12040353
>
>
> Interestingly, the primary for partition 5 is 1004, and its follower is
> the new node 1009.  (Partition 2 has 1009 as its leader and 1004 as its
> follower, and seems just fine.)
>
> I've attached all the kafka logs for the broker 1009 node since it
> launched yesterday.
>
> I guess my main question is: *Is there something I am fundamentally
> missing about the kafka model that makes it it not play well with
> autoscaling?*  I see a couple of other people on the internet talking
> about using ASGs with kafka, but always in the context of maintaining a
> list of broker ids and reusing them.
>
> *I don't want to do that.  I want the path 

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
IIRC, TextInputFormat supports an input path that is a comma separated
list. I haven't tried this, but I think you should just be able to do
sc.textFile("file1,file2,...")

On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:

> I know these workaround, but wouldn't it be more convenient and
> straightforward to use SparkContext#textFiles ?
>
> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
> wrote:
>
>> For more than a small number of files, you'd be better off using
>> SparkContext#union instead of RDD#union.  That will avoid building up a
>> lengthy lineage.
>>
>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>> wrote:
>>
>>> Hey Jeff,
>>> Do you mean reading from multiple text files? In that case, as a
>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>> multiple rdds. For example:
>>>
>>> val lines1 = sc.textFile("file1")
>>> val lines2 = sc.textFile("file2")
>>>
>>> val rdd = lines1 union lines2
>>>
>>> regards,
>>> --Jakob
>>>
>>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>>
 Although user can use the hdfs glob syntax to support multiple inputs.
 But sometimes, it is not convenient to do that. Not sure why there's no api
 of SparkContext#textFiles. It should be easy to implement that. I'd love to
 create a ticket and contribute for that if there's no other consideration
 that I don't know.

 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
Looks like what I was suggesting doesn't work. :/

On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> Yes, that's what I suggest. TextInputFormat support multiple inputs. So in
> spark side, we just need to provide API to for that.
>
> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
>
>> IIRC, TextInputFormat supports an input path that is a comma separated
>> list. I haven't tried this, but I think you should just be able to do
>> sc.textFile("file1,file2,...")
>>
>> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> I know these workaround, but wouldn't it be more convenient and
>>> straightforward to use SparkContext#textFiles ?
>>>
>>> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> For more than a small number of files, you'd be better off using
>>>> SparkContext#union instead of RDD#union.  That will avoid building up a
>>>> lengthy lineage.
>>>>
>>>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky <joder...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Jeff,
>>>>> Do you mean reading from multiple text files? In that case, as a
>>>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>>>> multiple rdds. For example:
>>>>>
>>>>> val lines1 = sc.textFile("file1")
>>>>> val lines2 = sc.textFile("file2")
>>>>>
>>>>> val rdd = lines1 union lines2
>>>>>
>>>>> regards,
>>>>> --Jakob
>>>>>
>>>>> On 11 November 2015 at 01:20, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>>
>>>>>> Although user can use the hdfs glob syntax to support multiple
>>>>>> inputs. But sometimes, it is not convenient to do that. Not sure why
>>>>>> there's no api of SparkContext#textFiles. It should be easy to implement
>>>>>> that. I'd love to create a ticket and contribute for that if there's no
>>>>>> other consideration that I don't know.
>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Datacenter to datacenter over the open internet

2015-10-06 Thread Pradeep Gollakota
At Lithium, we have multiple datacenters and we distcp our data across our
Hadoop clusters. We have 2 DCs in NA and 1 in EU. We have a non-redundant
direct connect from our EU cluster to one of our NA DCs. If and when this
fails, we have automatic failover to a VPN that goes over the internet. The
amount of data thats moving across the clusters is not much, so we can get
away with this. We don't have Kafka replication setup yet, but we will be
setting it up using Mirror Maker and the same constraints apply.

Of course opening up your Kafka cluster to be reachable by the internet
would work too, but IMHO a VPN is more secure and reduces the surface area
of your infrastructure that could come under attack. It sucks that you
can't get your executives on board for a p2p direct connect as that is the
best solution.

On Tue, Oct 6, 2015 at 5:48 PM, Gwen Shapira  wrote:

> You can configure "advertised.host.name" for each broker, which is the
> name
> external consumers and producers will use to refer to the brokers.
>
> On Tue, Oct 6, 2015 at 3:31 PM, Tom Brown  wrote:
>
> > Hello,
> >
> > How do you consume a kafka topic from a remote location without a
> dedicated
> > connection? How do you protect the server?
> >
> > The setup: data streams into our datacenter. We process it, and publish
> it
> > to a kafka cluster. The consumer is located in a different datacenter
> with
> > no direct connection. The most efficient scenario would be to setup a
> > point-to-point link but that idea has no traction with our executives. We
> > can setup a VPN; While functional, our IT department assures us that it
> > won't be able to scale.
> >
> > What we're currently planning is to expose the kafka cluster IP addresses
> > to the internet, and only allow access via firewall. Each message will be
> > encrypted with a shared private key, so we're not worried about messages
> > being intercepted. What we are worried about is this: how brokers refer
> to
> > each other-- when a broker directs the consumer to the server that is in
> > charge of a particular region, does it use the host name (that could be
> > externally mapped to the public IP) or does it use the detected/private
> IP
> > address.
> >
> > What solution would you use to consume a remote cluster?
> >
> > --Tom
> >
>


Re: Dealing with large messages

2015-10-06 Thread Pradeep Gollakota
Thanks for the replies!

I was rather hoping not to have to implement a side channel solution. :/

If we have to do this, we may use an HBase table with a TTL the same as our
topic so the large objects are "gc'ed"... thoughts?

On Tue, Oct 6, 2015 at 8:45 AM, Gwen Shapira <g...@confluent.io> wrote:

> Storing large blobs in S3 or HDFS and placing URIs in Kafka is the most
> common solution I've seen in use.
>
> On Tue, Oct 6, 2015 at 8:32 AM, Joel Koshy <jjkosh...@gmail.com> wrote:
>
> > The best practice I think is to just put large objects in a blob store
> > and have messages embed references to those blobs. Interestingly we
> > ended up having to implement large-message-support at LinkedIn but for
> > various reasons were forced to put messages inline (i.e., against the
> > above recommendation). So we ended up having to break up large
> > messages into smaller chunks. This obviously adds considerable
> > complexity to the consumer since the checkpointing can become pretty
> > complicated. There are other nuances as well - we can probably do a
> > short talk on this at an upcoming meetup.
> >
> > Joel
> >
> >
> > On Mon, Oct 5, 2015 at 9:31 PM, Rahul Jain <rahul...@gmail.com> wrote:
> > > In addition to the config changes mentioned in that post, you may also
> > have
> > > to change producer config if you are using the new producer.
> > >
> > > Specifically, *max.request.size* and *request.timeout.ms
> > > <http://request.timeout.ms>* have to be increased to allow the
> producer
> > to
> > > send large messages.
> > >
> > >
> > > On 6 Oct 2015 02:02, "James Cheng" <jch...@tivo.com> wrote:
> > >
> > >> Here’s an article that Gwen wrote earlier this year on handling large
> > >> messages in Kafka.
> > >>
> > >> http://ingest.tips/2015/01/21/handling-large-messages-kafka/
> > >>
> > >> -James
> > >>
> > >> > On Oct 5, 2015, at 11:20 AM, Pradeep Gollakota <
> pradeep...@gmail.com>
> > >> wrote:
> > >> >
> > >> > Fellow Kafkaers,
> > >> >
> > >> > We have a pretty heavyweight legacy event logging system for batch
> > >> > processing. We're now sending the events into Kafka now for realtime
> > >> > analytics. But we have some pretty large messages (> 40 MB).
> > >> >
> > >> > I'm wondering if any of you have use cases where you have to send
> > large
> > >> > messages to Kafka and how you're dealing with them.
> > >> >
> > >> > Thanks,
> > >> > Pradeep
> > >>
> > >>
> > >> 
> > >>
> > >> This email and any attachments may contain confidential and privileged
> > >> material for the sole use of the intended recipient. Any review,
> > copying,
> > >> or distribution of this email (or any attachments) by others is
> > prohibited.
> > >> If you are not the intended recipient, please contact the sender
> > >> immediately and permanently delete this email and any attachments. No
> > >> employee or agent of TiVo Inc. is authorized to conclude any binding
> > >> agreement on behalf of TiVo Inc. by email. Binding agreements with
> TiVo
> > >> Inc. may only be made by a signed written agreement.
> > >>
> >
>


Dealing with large messages

2015-10-05 Thread Pradeep Gollakota
Fellow Kafkaers,

We have a pretty heavyweight legacy event logging system for batch
processing. We're now sending the events into Kafka now for realtime
analytics. But we have some pretty large messages (> 40 MB).

I'm wondering if any of you have use cases where you have to send large
messages to Kafka and how you're dealing with them.

Thanks,
Pradeep


Re: number of topics given many consumers and groups within the data

2015-09-30 Thread Pradeep Gollakota
To add a little more context to Shaun's question, we have around 400
customers. Each customer has a stream of events. Some customers generate a
lot of data while others don't. We need to ensure that each customer's data
is sorted globally by timestamp.

We have two use cases around consumption:

1. A user may consume an individual customers data
2. A user may consume data for all customers

Given these two use cases, I think the better strategy is to have a
separate topic per customer as Todd suggested.

On Wed, Sep 30, 2015 at 9:26 AM, Todd Palino  wrote:

> So I disagree with the idea to use custom partitioning, depending on your
> requirements. Having a consumer consume from a single partition is not
> (currently) that easy. If you don't care which consumer gets which
> partition (group), then it's not that bad. You have 20 partitions, you have
> 20 consumers, and you use custom partitioning as noted. The consumers use
> the high level consumer with a single group, each one will get one
> partition each, and it's pretty straightforward. If a consumer crashes, you
> will end up with two partitions on one of the remaining consumers. If this
> is OK, this is a decent solution.
>
> If, however, you require that each consumer always have the same group of
> data, and you need to know what that group is beforehand, it's more
> difficult. You need to use the simple consumer to do it, which means you
> need to implement a lot of logic for error and status code handling
> yourself, and do it right. In this case, I think your idea of using 400
> separate topics is sound. This way you can still use the high level
> consumer, which takes care of the error handling for you, and your data is
> separated out by topic.
>
> Provided it is not an issue to implement it in your producer, I would go
> with the separate topics. Alternately, if you're not sure you always want
> separate topics, you could go with something similar to your second idea,
> but have a consumer read the single topic and split the data out into 400
> separate topics in Kafka (no need for Cassandra or Redis or anything else).
> Then your real consumers can all consume their separate topics. Reading and
> writing the data one extra time is much better than rereading all of it 400
> times and throwing most of it away.
>
> -Todd
>
>
> On Wed, Sep 30, 2015 at 9:06 AM, Ben Stopford  wrote:
>
> > Hi Shaun
> >
> > You might consider using a custom partition assignment strategy to push
> > your different “groups" to different partitions. This would allow you
> walk
> > the middle ground between "all consumers consume everything” and “one
> topic
> > per consumer” as you vary the number of partitions in the topic, albeit
> at
> > the cost of a little extra complexity.
> >
> > Also, not sure if you’ve seen it but there is quite a good section in the
> > FAQ here <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowmanytopicscanIhave
> ?>
> > on topic and partition sizing.
> >
> > B
> >
> > > On 29 Sep 2015, at 18:48, Shaun Senecal 
> > wrote:
> > >
> > > Hi
> > >
> > >
> > > I heave read Jay Kreps post regarding the number of topics that can be
> > handled by a broker (
> > https://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka),
> > and it has left me with more questions that I dont see answered anywhere
> > else.
> > >
> > >
> > > We have a data stream which will be consumed by many consumers (~400).
> > We also have many "groups" within our data.  A group in the data
> > corresponds 1:1 with what the consumers would consume, so consumer A only
> > ever see group A messages, consumer B only consumes group B messages,
> etc.
> > >
> > >
> > > The downstream consumers will be consuming via a websocket API, so the
> > API server will be the thing consuming from kafka.
> > >
> > >
> > > If I use a single topic with, say, 20 partitions, the consumers in the
> > API server would need to re-read the same messages over and over for each
> > consumer, which seems like a waste of network and a potential bottleneck.
> > >
> > >
> > > Alternatively, I could use a single topic with 20 partitions and have a
> > single consumer in the API put the messages into cassandra/redis (as
> > suggested by Jay), and serve out the downstream consumer streams that
> way.
> > However, that requires using a secondary sorted storage, which seems
> like a
> > waste (and added complexity) given that Kafka already has the data
> exactly
> > as I need it.  Especially if cassandra/redis are required to maintain a
> > long TTL on the stream.
> > >
> > >
> > > Finally, I could use 1 topic per group, each with a single partition.
> > This would result in 400 topics on the broker, but would allow the API
> > server to simply serve the stream for each consumer directly from kafka
> and
> > wont require additional machinery to serve out the requests.
> > >
> > >
> > > The 400 topic solution makes the 

CombineHiveInputFormat not working

2015-09-30 Thread Pradeep Gollakota
Hi all,

I have an external table of with the following DDL.

```
DROP TABLE IF EXISTS raw_events;
CREATE EXTERNAL TABLE IF NOT EXISTS raw_events (
raw_event_string string)
PARTITIONED BY (dc string, community string, dt string)
STORED AS TEXTFILE
LOCATION '/lithium/events/{dc}/{community}/events/{year}/{month}/{day}'
```

The files are loaded externally and are LZ4 compressed. When I run a query
on this table for a single day, I'm getting 1 mapper per file even though
the input format is set to CombineHiveInputFormat.

Does anyone know if CombineHiveInputFormat does not work with LZ4
compressed files or have any idea why split combination is not working?

Thanks!
Pradeep


Re: CombineHiveInputFormat not working

2015-09-30 Thread Pradeep Gollakota
mapred.min.split.size = mapreduce.input.fileinputformat.split.maxsize = 1
mapred.max.split.size = mapreduce.input.fileinputformat.split.maxsize =
134217728
hive.hadoop.supports.splittable.combineinputformat = false

My average file size is pretty small... it's usually between 500K and 20MB.

So it looks like the splittable support is turned off? I've been seeing
some posts on the mailing list saying there's correctness problems when
using this and LZO.

Is this still the case? Can I turn this on with LZ4?

Thanks!

On Wed, Sep 30, 2015 at 1:38 PM, Ryan Harris <ryan.har...@zionsbancorp.com>
wrote:

> Also...
>
> mapreduce.input.fileinputformat.split.maxsize
>
>
>
> and, what is the size of your input files?
>
>
>
> *From:* Ryan Harris
> *Sent:* Wednesday, September 30, 2015 2:37 PM
> *To:* 'user@hive.apache.org'
> *Subject:* RE: CombineHiveInputFormat not working
>
>
>
> what are your values for:
>
> mapred.min.split.size
>
> mapred.max.split.size
>
> hive.hadoop.supports.splittable.combineinputformat
>
>
>
>
>
> *From:* Pradeep Gollakota [mailto:pradeep...@gmail.com]
> *Sent:* Wednesday, September 30, 2015 2:20 PM
> *To:* user@hive.apache.org
> *Subject:* CombineHiveInputFormat not working
>
>
>
> Hi all,
>
>
>
> I have an external table of with the following DDL.
>
>
>
> ```
>
> DROP TABLE IF EXISTS raw_events;
>
> CREATE EXTERNAL TABLE IF NOT EXISTS raw_events (
>
> raw_event_string string)
>
> PARTITIONED BY (dc string, community string, dt string)
>
> STORED AS TEXTFILE
>
> LOCATION '/lithium/events/{dc}/{community}/events/{year}/{month}/{day}'
>
> ```
>
>
>
> The files are loaded externally and are LZ4 compressed. When I run a query
> on this table for a single day, I'm getting 1 mapper per file even though
> the input format is set to CombineHiveInputFormat.
>
>
>
> Does anyone know if CombineHiveInputFormat does not work with LZ4
> compressed files or have any idea why split combination is not working?
>
>
>
> Thanks!
>
> Pradeep
> --
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>


Re: CombineHiveInputFormat not working

2015-09-30 Thread Pradeep Gollakota
I'm running with CDH 5.3.3 (Hadoop 2.5.0 + cdh patches)... so those two
issues are hopefully not an issue. I'll try the two configs suggested and
report back.

Thanks!

On Wed, Sep 30, 2015 at 3:14 PM, Ryan Harris <ryan.har...@zionsbancorp.com>
wrote:

> I would suggest trying:
>
> set hive.hadoop.supports.splittable.combineinputformat = true;
>
>
>
> you might also need to increase mapreduce.input.fileinputformat.split.minsize
> to something larger, like 32MB
>
> set mapreduce.input.fileinputformat.split.minsize = 33554432;
>
>
>
> Depending on your hadoop distro and version, be potentially aware of
>
> https://issues.apache.org/jira/browse/MAPREDUCE-1597
>
> and
>
> https://issues.apache.org/jira/browse/MAPREDUCE-5537
>
>
>
> test it and see...
>
>
>
> *From:* Pradeep Gollakota [mailto:pradeep...@gmail.com]
> *Sent:* Wednesday, September 30, 2015 3:33 PM
> *To:* user@hive.apache.org
> *Subject:* Re: CombineHiveInputFormat not working
>
>
>
> mapred.min.split.size = mapreduce.input.fileinputformat.split.maxsize = 1
> mapred.max.split.size = mapreduce.input.fileinputformat.split.maxsize =
> 134217728
> hive.hadoop.supports.splittable.combineinputformat = false
>
>
>
> My average file size is pretty small... it's usually between 500K and 20MB.
>
>
>
> So it looks like the splittable support is turned off? I've been seeing
> some posts on the mailing list saying there's correctness problems when
> using this and LZO.
>
>
>
> Is this still the case? Can I turn this on with LZ4?
>
>
>
> Thanks!
>
>
>
> On Wed, Sep 30, 2015 at 1:38 PM, Ryan Harris <ryan.har...@zionsbancorp.com>
> wrote:
>
> Also...
>
> mapreduce.input.fileinputformat.split.maxsize
>
>
>
> and, what is the size of your input files?
>
>
>
> *From:* Ryan Harris
> *Sent:* Wednesday, September 30, 2015 2:37 PM
> *To:* 'user@hive.apache.org'
> *Subject:* RE: CombineHiveInputFormat not working
>
>
>
> what are your values for:
>
> mapred.min.split.size
>
> mapred.max.split.size
>
> hive.hadoop.supports.splittable.combineinputformat
>
>
>
>
>
> *From:* Pradeep Gollakota [mailto:pradeep...@gmail.com]
> *Sent:* Wednesday, September 30, 2015 2:20 PM
> *To:* user@hive.apache.org
> *Subject:* CombineHiveInputFormat not working
>
>
>
> Hi all,
>
>
>
> I have an external table of with the following DDL.
>
>
>
> ```
>
> DROP TABLE IF EXISTS raw_events;
>
> CREATE EXTERNAL TABLE IF NOT EXISTS raw_events (
>
> raw_event_string string)
>
> PARTITIONED BY (dc string, community string, dt string)
>
> STORED AS TEXTFILE
>
> LOCATION '/lithium/events/{dc}/{community}/events/{year}/{month}/{day}'
>
> ```
>
>
>
> The files are loaded externally and are LZ4 compressed. When I run a query
> on this table for a single day, I'm getting 1 mapper per file even though
> the input format is set to CombineHiveInputFormat.
>
>
>
> Does anyone know if CombineHiveInputFormat does not work with LZ4
> compressed files or have any idea why split combination is not working?
>
>
>
> Thanks!
>
> Pradeep
> --
>
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>
>
> --
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>


Re: Very slow dynamic partition load

2015-06-11 Thread Pradeep Gollakota
Hmm... did your performance increase with the patch you supplied? I do need
the partitions in Hive, but I have a separate tool that has the ability to
add partitions to the metastore and is definitely much faster than this. I
just checked my job again, the actual Hive job completed 24 hours ago and
has been adding the dynamic partitions to the metastore since then and is
still not done. According to the metastore theres only 10830 partitions
added so far... at this pace, it will take approximately 2 more days for it
complete.

On Thu, Jun 11, 2015 at 1:18 PM, Slava Markeyev slava.marke...@upsight.com
wrote:

 This is something that a few of us have run into. I think the bottleneck
 is in partition creation calls to the metastore. My work around was
 HIVE-10385 which optionally removed partition creation in the metastore but
 this isn't a solution for everyone. If you don't require actual partitions
 in the table but simply partitioned data in hdfs give it a shot. It may be
 worthwhile looking into optimizations for this use case.

 -Slava

 On Thu, Jun 11, 2015 at 11:56 AM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 Hi All,

 I have a table which is partitioned on two columns (customer, date). I'm
 loading some data into the table using a Hive query. The MapReduce job
 completed within a few minutes and needs to commit the data to the
 appropriate partitions. There were about 32000 partitions generated. The
 commit phase has been running for almost 16 hours and has not finished yet.
 I've been monitoring jmap, and don't believe it's a memory or gc issue.
 I've also been looking at jstack and not sure why it's so slow. I'm not
 sure what the problem is, but seems to be a Hive performance issue when it
 comes to highly partitioned tables.

 Any thoughts on this issue would be greatly appreciated.

 Thanks in advance,
 Pradeep




 --

 Slava Markeyev | Engineering | Upsight

 Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev
 http://www.linkedin.com/in/slavamarkeyev



Re: Very slow dynamic partition load

2015-06-11 Thread Pradeep Gollakota
I actually decided to remove one of my 2 partition columns and make it a
bucketing column instead... same query completed fully in under 10 minutes
with 92 partitions added. This will suffice for me for now.

On Thu, Jun 11, 2015 at 2:25 PM, Pradeep Gollakota pradeep...@gmail.com
wrote:

 Hmm... did your performance increase with the patch you supplied? I do
 need the partitions in Hive, but I have a separate tool that has the
 ability to add partitions to the metastore and is definitely much faster
 than this. I just checked my job again, the actual Hive job completed 24
 hours ago and has been adding the dynamic partitions to the metastore since
 then and is still not done. According to the metastore theres only 10830
 partitions added so far... at this pace, it will take approximately 2 more
 days for it complete.

 On Thu, Jun 11, 2015 at 1:18 PM, Slava Markeyev 
 slava.marke...@upsight.com wrote:

 This is something that a few of us have run into. I think the bottleneck
 is in partition creation calls to the metastore. My work around was
 HIVE-10385 which optionally removed partition creation in the metastore but
 this isn't a solution for everyone. If you don't require actual partitions
 in the table but simply partitioned data in hdfs give it a shot. It may be
 worthwhile looking into optimizations for this use case.

 -Slava

 On Thu, Jun 11, 2015 at 11:56 AM, Pradeep Gollakota pradeep...@gmail.com
  wrote:

 Hi All,

 I have a table which is partitioned on two columns (customer, date). I'm
 loading some data into the table using a Hive query. The MapReduce job
 completed within a few minutes and needs to commit the data to the
 appropriate partitions. There were about 32000 partitions generated. The
 commit phase has been running for almost 16 hours and has not finished yet.
 I've been monitoring jmap, and don't believe it's a memory or gc issue.
 I've also been looking at jstack and not sure why it's so slow. I'm not
 sure what the problem is, but seems to be a Hive performance issue when it
 comes to highly partitioned tables.

 Any thoughts on this issue would be greatly appreciated.

 Thanks in advance,
 Pradeep




 --

 Slava Markeyev | Engineering | Upsight

 Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev
 http://www.linkedin.com/in/slavamarkeyev





Very slow dynamic partition load

2015-06-11 Thread Pradeep Gollakota
Hi All,

I have a table which is partitioned on two columns (customer, date). I'm
loading some data into the table using a Hive query. The MapReduce job
completed within a few minutes and needs to commit the data to the
appropriate partitions. There were about 32000 partitions generated. The
commit phase has been running for almost 16 hours and has not finished yet.
I've been monitoring jmap, and don't believe it's a memory or gc issue.
I've also been looking at jstack and not sure why it's so slow. I'm not
sure what the problem is, but seems to be a Hive performance issue when it
comes to highly partitioned tables.

Any thoughts on this issue would be greatly appreciated.

Thanks in advance,
Pradeep


Re: HCatInputFormat combine splits

2015-05-14 Thread Pradeep Gollakota
The following property has been to no effect.

mapreduce.input.fileinputformat.split.maxsize = 67108864

I'm still getting 1 Mapper per file.

On Thu, May 14, 2015 at 10:27 AM, Ankit Bhatnagar ank...@yahoo-inc.com
wrote:

 you can explicitly set the split size



   On Wednesday, May 13, 2015 11:37 PM, Pradeep Gollakota 
 pradeep...@gmail.com wrote:


 Hi All,

 I'm writing an MR job to read data using HCatInputFormat... however, the
 job is generating too many splits. I don't have this problem when running
 queries in Hive since it combines splits by default.

 Is there an equivalent in MR so that I'm not generating thousands of
 mappers?

 Thanks,
 Pradeep





HCatInputFormat combine splits

2015-05-14 Thread Pradeep Gollakota
Hi All,

I'm writing an MR job to read data using HCatInputFormat... however, the
job is generating too many splits. I don't have this problem when running
queries in Hive since it combines splits by default.

Is there an equivalent in MR so that I'm not generating thousands of
mappers?

Thanks,
Pradeep


Re: How to stop a mapreduce job from terminal running on Hadoop Cluster?

2015-04-12 Thread Pradeep Gollakota
Also, mapred job -kill job_id

On Sun, Apr 12, 2015 at 11:07 AM, Shahab Yunus shahab.yu...@gmail.com
wrote:

 You can kill t by using the following yarn command

 yarn application -kill application id

 https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html

 Or use old hadoop job command
 http://stackoverflow.com/questions/11458519/how-to-kill-hadoop-jobs

 Regards,
 Shahab

 On Sun, Apr 12, 2015 at 2:03 PM, Answer Agrawal yrsna.tse...@gmail.com
 wrote:

 To run a job we use the command
 $ hadoop jar example.jar inputpath outputpath
 If job is so time taken and we want to stop it in middle then which
 command is used? Or is there any other way to do that?

 Thanks,






Re: integrate Camus and Hive?

2015-03-09 Thread Pradeep Gollakota
If I understood your question correctly, you want to be able to read the
output of Camus in Hive and be able to know partition values. If my
understanding is right, you can do so by using the following.

Hive provides the ability to provide custom patterns for partitions. You
can use this in combination with MSCK REPAIR TABLE to automatically detect
and load the partitions into the metastore.

Take a look at this SO
http://stackoverflow.com/questions/24289571/hive-0-13-external-table-dynamic-partitioning-custom-pattern

Does that help?


On Mon, Mar 9, 2015 at 1:42 PM, Yang tedd...@gmail.com wrote:

 I believe many users like us would export the output from camus as a hive
 external table. but the dir structure of camus is like
 //MM/DD/xx

 while hive generally expects /year=/month=MM/day=DD/xx if you
 define that table to be
 partitioned by (year, month, day). otherwise you'd have to add those
 partitions created by camus through a separate command. but in the latter
 case, would a camus job create 1 partitions ? how would we find out the
 /MM/DD values from outside ?  well you could always do something by
 hadoop dfs -ls and then grep the output, but it's kind of not clean


 thanks
 yang



Re: JIRA attack!

2015-02-08 Thread Pradeep Gollakota
Apparently I joined this list at the right time :P

On Sat, Feb 7, 2015 at 4:40 PM, Jay Kreps jay.kr...@gmail.com wrote:

 I closed about 350 redundant or obsolete issues. If I closed an issue you
 think is not obsolete, my apologies, just reopen.

 -Jay



[jira] [Commented] (KAFKA-1884) New Producer blocks forever for Invalid topic names

2015-02-06 Thread Pradeep Gollakota (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310241#comment-14310241
 ] 

Pradeep Gollakota commented on KAFKA-1884:
--

[~guozhang] That's what I figured at first. But the odd behavior is that the 
exception storm is happening on server even after the producer has been shut 
down (and the broker restarted). Not sure why that would be the case.

 New Producer blocks forever for Invalid topic names
 ---

 Key: KAFKA-1884
 URL: https://issues.apache.org/jira/browse/KAFKA-1884
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Manikumar Reddy
 Fix For: 0.8.3


 New producer blocks forever for invalid topics names
 producer logs:
 DEBUG [2015-01-20 12:46:13,406] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,406] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50845,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,416] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50845.
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50846,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,417] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50846.
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,418] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50847,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,418] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50847.
 Broker logs:
 [2015-01-20 12:46:14,074] ERROR [KafkaApi-0] error when handling request 
 Name: TopicMetadataRequest; Version: 0; CorrelationId: 51020; ClientId: 
 my-producer; Topics: TOPIC= (kafka.server.KafkaApis)
 kafka.common.InvalidTopicException: topic name TOPIC= is illegal, contains a 
 character other than ASCII alphanumerics, '.', '_' and '-'
   at kafka.common.Topic$.validate(Topic.scala:42)
   at 
 kafka.admin.AdminUtils$.createOrUpdateTopicPartitionAssignmentPathInZK(AdminUtils.scala:186)
   at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:177)
   at kafka.server.KafkaApis$$anonfun$5.apply(KafkaApis.scala:367)
   at kafka.server.KafkaApis$$anonfun$5.apply(KafkaApis.scala:350)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at 
 scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
   at scala.collection.SetLike$class.map(SetLike.scala:93)
   at scala.collection.AbstractSet.map(Set.scala:47)
   at kafka.server.KafkaApis.getTopicMetadata(KafkaApis.scala:350)
   at 
 kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:389)
   at kafka.server.KafkaApis.handle(KafkaApis.scala:57)
   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59)
   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1884) New Producer blocks forever for Invalid topic names

2015-02-06 Thread Pradeep Gollakota (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310548#comment-14310548
 ] 

Pradeep Gollakota commented on KAFKA-1884:
--

I guess that makes sense... I'll confirm.

 New Producer blocks forever for Invalid topic names
 ---

 Key: KAFKA-1884
 URL: https://issues.apache.org/jira/browse/KAFKA-1884
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Manikumar Reddy
 Fix For: 0.8.3


 New producer blocks forever for invalid topics names
 producer logs:
 DEBUG [2015-01-20 12:46:13,406] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,406] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50845,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,416] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50845.
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50846,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,417] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50846.
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,418] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50847,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,418] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50847.
 Broker logs:
 [2015-01-20 12:46:14,074] ERROR [KafkaApi-0] error when handling request 
 Name: TopicMetadataRequest; Version: 0; CorrelationId: 51020; ClientId: 
 my-producer; Topics: TOPIC= (kafka.server.KafkaApis)
 kafka.common.InvalidTopicException: topic name TOPIC= is illegal, contains a 
 character other than ASCII alphanumerics, '.', '_' and '-'
   at kafka.common.Topic$.validate(Topic.scala:42)
   at 
 kafka.admin.AdminUtils$.createOrUpdateTopicPartitionAssignmentPathInZK(AdminUtils.scala:186)
   at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:177)
   at kafka.server.KafkaApis$$anonfun$5.apply(KafkaApis.scala:367)
   at kafka.server.KafkaApis$$anonfun$5.apply(KafkaApis.scala:350)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at 
 scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
   at scala.collection.SetLike$class.map(SetLike.scala:93)
   at scala.collection.AbstractSet.map(Set.scala:47)
   at kafka.server.KafkaApis.getTopicMetadata(KafkaApis.scala:350)
   at 
 kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:389)
   at kafka.server.KafkaApis.handle(KafkaApis.scala:57)
   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59)
   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1884) New Producer blocks forever for Invalid topic names

2015-02-05 Thread Pradeep Gollakota (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308670#comment-14308670
 ] 

Pradeep Gollakota commented on KAFKA-1884:
--

What makes the behavior in #2 earlier even more odd is, I stopped the server, 
deleted the znodes, deleted the kafka log dir and restarted the server and the 
same behavior is seen.

O.o

 New Producer blocks forever for Invalid topic names
 ---

 Key: KAFKA-1884
 URL: https://issues.apache.org/jira/browse/KAFKA-1884
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Manikumar Reddy
 Fix For: 0.8.3


 New producer blocks forever for invalid topics names
 producer logs:
 DEBUG [2015-01-20 12:46:13,406] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,406] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50845,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,416] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50845.
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50846,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,417] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50846.
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,418] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50847,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,418] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50847.
 Broker logs:
 [2015-01-20 12:46:14,074] ERROR [KafkaApi-0] error when handling request 
 Name: TopicMetadataRequest; Version: 0; CorrelationId: 51020; ClientId: 
 my-producer; Topics: TOPIC= (kafka.server.KafkaApis)
 kafka.common.InvalidTopicException: topic name TOPIC= is illegal, contains a 
 character other than ASCII alphanumerics, '.', '_' and '-'
   at kafka.common.Topic$.validate(Topic.scala:42)
   at 
 kafka.admin.AdminUtils$.createOrUpdateTopicPartitionAssignmentPathInZK(AdminUtils.scala:186)
   at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:177)
   at kafka.server.KafkaApis$$anonfun$5.apply(KafkaApis.scala:367)
   at kafka.server.KafkaApis$$anonfun$5.apply(KafkaApis.scala:350)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at 
 scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
   at scala.collection.SetLike$class.map(SetLike.scala:93)
   at scala.collection.AbstractSet.map(Set.scala:47)
   at kafka.server.KafkaApis.getTopicMetadata(KafkaApis.scala:350)
   at 
 kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:389)
   at kafka.server.KafkaApis.handle(KafkaApis.scala:57)
   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59)
   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1884) New Producer blocks forever for Invalid topic names

2015-02-05 Thread Pradeep Gollakota (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308539#comment-14308539
 ] 

Pradeep Gollakota commented on KAFKA-1884:
--

I'd like to work on this. Please assign to me.

I've been able to reproduce the issue. I also noticed another oddity about this 
though.

1. The server side error above is being repeated 100's of times a second (each 
repeat increments the CorrelationId). This seems to indicate some type of retry 
logic.
2. If I kill the server, kill the client and start the server. The error 
continues to repeat. This seems to indicate that this request may be persisted 
somewhere.

I have a good grasp of where to start looking for the problem, though I have no 
idea why the above two are occurring.

 New Producer blocks forever for Invalid topic names
 ---

 Key: KAFKA-1884
 URL: https://issues.apache.org/jira/browse/KAFKA-1884
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Manikumar Reddy
 Fix For: 0.8.3


 New producer blocks forever for invalid topics names
 producer logs:
 DEBUG [2015-01-20 12:46:13,406] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,406] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50845,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,416] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50845.
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50846,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,417] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50846.
 DEBUG [2015-01-20 12:46:13,417] NetworkClient: maybeUpdateMetadata(): Trying 
 to send metadata request to node -1
 DEBUG [2015-01-20 12:46:13,418] NetworkClient: maybeUpdateMetadata(): Sending 
 metadata request ClientRequest(expectResponse=true, payload=null, 
 request=RequestSend(header={api_key=3,api_version=0,correlation_id=50847,client_id=my-producer},
  body={topics=[TOPIC=]})) to node -1
 TRACE [2015-01-20 12:46:13,418] NetworkClient: handleMetadataResponse(): 
 Ignoring empty metadata response with correlation id 50847.
 Broker logs:
 [2015-01-20 12:46:14,074] ERROR [KafkaApi-0] error when handling request 
 Name: TopicMetadataRequest; Version: 0; CorrelationId: 51020; ClientId: 
 my-producer; Topics: TOPIC= (kafka.server.KafkaApis)
 kafka.common.InvalidTopicException: topic name TOPIC= is illegal, contains a 
 character other than ASCII alphanumerics, '.', '_' and '-'
   at kafka.common.Topic$.validate(Topic.scala:42)
   at 
 kafka.admin.AdminUtils$.createOrUpdateTopicPartitionAssignmentPathInZK(AdminUtils.scala:186)
   at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:177)
   at kafka.server.KafkaApis$$anonfun$5.apply(KafkaApis.scala:367)
   at kafka.server.KafkaApis$$anonfun$5.apply(KafkaApis.scala:350)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at 
 scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
   at scala.collection.SetLike$class.map(SetLike.scala:93)
   at scala.collection.AbstractSet.map(Set.scala:47)
   at kafka.server.KafkaApis.getTopicMetadata(KafkaApis.scala:350)
   at 
 kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:389)
   at kafka.server.KafkaApis.handle(KafkaApis.scala:57)
   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59)
   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [kafka-clients] Re: [VOTE] 0.8.2.0 Candidate 3

2015-02-03 Thread Pradeep Gollakota
Lithium Technologies would love to host you guys for a release party in SF
if you guys want.

:)

On Tue, Feb 3, 2015 at 11:04 AM, Gwen Shapira gshap...@cloudera.com wrote:

 When's the party?
 :)

 On Mon, Feb 2, 2015 at 8:13 PM, Jay Kreps jay.kr...@gmail.com wrote:
  Yay!
 
  -Jay
 
  On Mon, Feb 2, 2015 at 2:23 PM, Neha Narkhede n...@confluent.io wrote:
 
  Great! Thanks Jun for helping with the release and everyone involved for
  your contributions.
 
  On Mon, Feb 2, 2015 at 1:32 PM, Joe Stein joe.st...@stealth.ly wrote:
 
   Huzzah!
  
   Thanks Jun for preparing the release candidates and getting this out
 to
   the
   community.
  
   - Joe Stein
  
   On Mon, Feb 2, 2015 at 2:27 PM, Jun Rao j...@confluent.io wrote:
  
The following are the results of the votes.
   
+1 binding = 3 votes
+1 non-binding = 1 votes
-1 = 0 votes
0 = 0 votes
   
The vote passes.
   
I will release artifacts to maven central, update the dist svn and
   download
site. Will send out an announce after that.
   
Thanks everyone that contributed to the work in 0.8.2.0!
   
Jun
   
On Wed, Jan 28, 2015 at 9:22 PM, Jun Rao j...@confluent.io wrote:
   
This is the third candidate for release of Apache Kafka 0.8.2.0.
   
Release Notes for the 0.8.2.0 release
   
   
  
  
 https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/RELEASE_NOTES.html
   
*** Please download, test and vote by Saturday, Jan 31, 11:30pm PT
   
Kafka's KEYS file containing PGP keys we use to sign the release:
http://kafka.apache.org/KEYS in addition to the md5, sha1 and sha2
(SHA256) checksum.
   
* Release artifacts to be voted upon (source and binary):
https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/
   
* Maven artifacts to be voted upon prior to release:
https://repository.apache.org/content/groups/staging/
   
* scala-doc
   
 https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/scaladoc/
   
* java-doc
   
 https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/javadoc/
   
* The tag to be voted upon (off the 0.8.2 branch) is the 0.8.2.0
 tag
   
   
  
  
 https://git-wip-us.apache.org/repos/asf?p=kafka.git;a=tag;h=223ac42a7a2a0dab378cc411f4938a9cea1eb7ea
(commit 7130da90a9ee9e6fb4beb2a2a6ab05c06c9bfac4)
   
/***
   
Thanks,
   
Jun
   
   
 --
You received this message because you are subscribed to the Google
Groups
kafka-clients group.
To unsubscribe from this group and stop receiving emails from it,
 send
an
email to kafka-clients+unsubscr...@googlegroups.com.
To post to this group, send email to kafka-clie...@googlegroups.com
 .
Visit this group at http://groups.google.com/group/kafka-clients.
To view this discussion on the web visit
   
  
  
 https://groups.google.com/d/msgid/kafka-clients/CAFc58G-XRpw9ik35%2BCmsYm-uc2hjHet6fOpw4bF90ka9Z%3DhH%3Dw%40mail.gmail.com

  
  
 https://groups.google.com/d/msgid/kafka-clients/CAFc58G-XRpw9ik35%2BCmsYm-uc2hjHet6fOpw4bF90ka9Z%3DhH%3Dw%40mail.gmail.com?utm_medium=emailutm_source=footer
   
.
   
For more options, visit https://groups.google.com/d/optout.
   
  
 
 
 
  --
  Thanks,
  Neha
 
 
  --
  You received this message because you are subscribed to the Google Groups
  kafka-clients group.
  To unsubscribe from this group and stop receiving emails from it, send an
  email to kafka-clients+unsubscr...@googlegroups.com.
  To post to this group, send email to kafka-clie...@googlegroups.com.
  Visit this group at http://groups.google.com/group/kafka-clients.
  To view this discussion on the web visit
 
 https://groups.google.com/d/msgid/kafka-clients/CAOeJiJjkYXyK_3qxJYpchG%2B_-c1Jt6K_skT_1geP%3DEJXV5w9uQ%40mail.gmail.com
 .
 
  For more options, visit https://groups.google.com/d/optout.



Re: [kafka-clients] Re: [VOTE] 0.8.2.0 Candidate 3

2015-02-03 Thread Pradeep Gollakota
Lithium Technologies would love to host you guys for a release party in SF
if you guys want.

:)

On Tue, Feb 3, 2015 at 11:04 AM, Gwen Shapira gshap...@cloudera.com wrote:

 When's the party?
 :)

 On Mon, Feb 2, 2015 at 8:13 PM, Jay Kreps jay.kr...@gmail.com wrote:
  Yay!
 
  -Jay
 
  On Mon, Feb 2, 2015 at 2:23 PM, Neha Narkhede n...@confluent.io wrote:
 
  Great! Thanks Jun for helping with the release and everyone involved for
  your contributions.
 
  On Mon, Feb 2, 2015 at 1:32 PM, Joe Stein joe.st...@stealth.ly wrote:
 
   Huzzah!
  
   Thanks Jun for preparing the release candidates and getting this out
 to
   the
   community.
  
   - Joe Stein
  
   On Mon, Feb 2, 2015 at 2:27 PM, Jun Rao j...@confluent.io wrote:
  
The following are the results of the votes.
   
+1 binding = 3 votes
+1 non-binding = 1 votes
-1 = 0 votes
0 = 0 votes
   
The vote passes.
   
I will release artifacts to maven central, update the dist svn and
   download
site. Will send out an announce after that.
   
Thanks everyone that contributed to the work in 0.8.2.0!
   
Jun
   
On Wed, Jan 28, 2015 at 9:22 PM, Jun Rao j...@confluent.io wrote:
   
This is the third candidate for release of Apache Kafka 0.8.2.0.
   
Release Notes for the 0.8.2.0 release
   
   
  
  
 https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/RELEASE_NOTES.html
   
*** Please download, test and vote by Saturday, Jan 31, 11:30pm PT
   
Kafka's KEYS file containing PGP keys we use to sign the release:
http://kafka.apache.org/KEYS in addition to the md5, sha1 and sha2
(SHA256) checksum.
   
* Release artifacts to be voted upon (source and binary):
https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/
   
* Maven artifacts to be voted upon prior to release:
https://repository.apache.org/content/groups/staging/
   
* scala-doc
   
 https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/scaladoc/
   
* java-doc
   
 https://people.apache.org/~junrao/kafka-0.8.2.0-candidate3/javadoc/
   
* The tag to be voted upon (off the 0.8.2 branch) is the 0.8.2.0
 tag
   
   
  
  
 https://git-wip-us.apache.org/repos/asf?p=kafka.git;a=tag;h=223ac42a7a2a0dab378cc411f4938a9cea1eb7ea
(commit 7130da90a9ee9e6fb4beb2a2a6ab05c06c9bfac4)
   
/***
   
Thanks,
   
Jun
   
   
 --
You received this message because you are subscribed to the Google
Groups
kafka-clients group.
To unsubscribe from this group and stop receiving emails from it,
 send
an
email to kafka-clients+unsubscr...@googlegroups.com.
To post to this group, send email to kafka-clie...@googlegroups.com
 .
Visit this group at http://groups.google.com/group/kafka-clients.
To view this discussion on the web visit
   
  
  
 https://groups.google.com/d/msgid/kafka-clients/CAFc58G-XRpw9ik35%2BCmsYm-uc2hjHet6fOpw4bF90ka9Z%3DhH%3Dw%40mail.gmail.com

  
  
 https://groups.google.com/d/msgid/kafka-clients/CAFc58G-XRpw9ik35%2BCmsYm-uc2hjHet6fOpw4bF90ka9Z%3DhH%3Dw%40mail.gmail.com?utm_medium=emailutm_source=footer
   
.
   
For more options, visit https://groups.google.com/d/optout.
   
  
 
 
 
  --
  Thanks,
  Neha
 
 
  --
  You received this message because you are subscribed to the Google Groups
  kafka-clients group.
  To unsubscribe from this group and stop receiving emails from it, send an
  email to kafka-clients+unsubscr...@googlegroups.com.
  To post to this group, send email to kafka-clie...@googlegroups.com.
  Visit this group at http://groups.google.com/group/kafka-clients.
  To view this discussion on the web visit
 
 https://groups.google.com/d/msgid/kafka-clients/CAOeJiJjkYXyK_3qxJYpchG%2B_-c1Jt6K_skT_1geP%3DEJXV5w9uQ%40mail.gmail.com
 .
 
  For more options, visit https://groups.google.com/d/optout.



Re: Hive - regexp_replace function for multiple strings

2015-02-03 Thread Pradeep Gollakota
I don't think this is doable using the out of the box regexp_replace() UDF.
That way I would do it, is using a file to create a mapping between a
regexp and it's replacement and write a custom UDF that loads this file and
applies all regular expressions on the input.

Hope this helps.

On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh viral.j.par...@gmail.com
wrote:

 Hi Everyone,

 I am using hive 0.13! I want to find multiple tokens like hip hop and
 rock music in my data and replace them with hiphop and rockmusic -
 basically replace them without white space. I have used the regexp_replace
 function in hive. Below is my query and it works great for above 2 examples.

 drop table vp_hiphop;
 create table vp_hiphop asselect userid, ntext,
regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock 
 music', 'rockmusic') as ntext1from  vp_nlp_protext_males;

 But I have 100 such bigrams/ngrams and want to be able to do replace
 efficiently where I just remove the whitespace. I can pattern match the
 phrase - hip hop and rock music but in the replace I want to simply trim
 the white spaces. Below is what I tried. I also tried using trim with
 regexp_replace but it wants the third argument in the regexp_replace
 function.

 drop table vp_hiphop;
 create table vp_hiphop asselect  userid, ntext,
 regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1from  
 vp_nlp_protext_males;




Re: New Producer - ONLY sync mode?

2015-02-02 Thread Pradeep Gollakota
This is a great question Otis. Like Gwen said, you can accomplish Sync mode
by setting the batch size to 1. But this does highlight a shortcoming of
the new producer API.

I really like the design of the new API and it has really great properties
and I'm enjoying working with it. However, once API that I think we're
lacking is a batch API. Currently, I have to iterate over a batch and
call .send() on each record, which returns n callbacks instead of 1
callback for the whole batch. This significantly complicates recovery logic
where we need to commit a batch as opposed 1 record at a time.

Do you guys have any plans to add better semantics around batches?

On Mon, Feb 2, 2015 at 1:34 PM, Gwen Shapira gshap...@cloudera.com wrote:

 If I understood the code and Jay correctly - if you wait for the
 future it will be a similar delay to that of the old sync producer.

 Put another way, if you test it out and see longer delays than the
 sync producer had, we need to find out why and fix it.

 Gwen

 On Mon, Feb 2, 2015 at 1:27 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  Nope, unfortunately it can't do that.  X is a remote app, doesn't listen
 to
  anything external, calls Y via HTTPS.  So X has to decide what to do with
  its data based on Y's synchronous response.  It has to block until Y
  responds.  And it wouldn't be pretty, I think, because nobody wants to
 run
  apps that talk to remove servers and hang on to connections more than
 they
  have to.  But perhaps that is the only way?  Or maybe the answer to I'm
  guessing the delay would be more or less the same as if the Producer was
  using SYNC mode? is YES, in which case the connection from X to Y would
 be
  open for just as long as with a SYNC producer running in Y?
 
  Thanks,
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Mon, Feb 2, 2015 at 4:03 PM, Gwen Shapira gshap...@cloudera.com
 wrote:
 
  Can Y have a callback that will handle the notification to X?
  In this case, perhaps Y can be async and X can buffer the data until
  the callback triggers and says all good (or resend if the callback
  indicates an error)
 
  On Mon, Feb 2, 2015 at 12:56 PM, Otis Gospodnetic
  otis.gospodne...@gmail.com wrote:
   Hi,
  
   Thanks for the info.  Here's the use case.  We have something up
 stream
   sending data, say a log shipper called X.  It sends it to some remote
   component Y.  Y is the Kafka Producer and it puts data into Kafka.
 But Y
   needs to send a reply to X and tell it whether it successfully put all
  its
   data into Kafka.  If it did not, Y wants to tell X to buffer data
 locally
   and resend it later.
  
   If producer is ONLY async, Y can't easily do that.  Or maybe Y would
 just
   need to wait for the Future to come back and only then send the
 response
   back to X?  If so, I'm guessing the delay would be more or less the
 same
  as
   if the Producer was using SYNC mode?
  
   Thanks,
   Otis
   --
   Monitoring * Alerting * Anomaly Detection * Centralized Log Management
   Solr  Elasticsearch Support * http://sematext.com/
  
  
   On Mon, Feb 2, 2015 at 3:13 PM, Jay Kreps jay.kr...@gmail.com
 wrote:
  
   Yeah as Gwen says there is no sync/async mode anymore. There is a new
   configuration which does a lot of what async did in terms of allowing
   batching:
  
   batch.size - This is the target amount of data per partition the
 server
   will attempt to batch together.
   linger.ms - This is the time the producer will wait for more data
 to be
   sent to better batch up writes. The default is 0 (send immediately).
 So
  if
   you set this to 50 ms the client will send immediately if it has
 already
   filled up its batch, otherwise it will wait to accumulate the number
 of
   bytes given by batch.size.
  
   To send asynchronously you do
  producer.send(record)
   whereas to block on a response you do
  producer.send(record).get();
   which will wait for acknowledgement from the server.
  
   One advantage of this model is that the client will do it's best to
  batch
   under the covers even if linger.ms=0. It will do this by batching
 any
  data
   that arrives while another send is in progress into a single
   request--giving a kind of group commit effect.
  
   The hope is that this will be both simpler to understand (a single
 api
  that
   always works the same) and more powerful (you always get a response
 with
   error and offset information whether or not you choose to use it).
  
   -Jay
  
  
   On Mon, Feb 2, 2015 at 11:15 AM, Gwen Shapira gshap...@cloudera.com
 
   wrote:
  
If you want to emulate the old sync producer behavior, you need to
 set
the batch size to 1  (in producer config) and wait on the future
 you
get from Send (i.e. future.get)
   
I can't think of good reasons to do so, though.
   
Gwen
   
   
On Mon, Feb 2, 2015 at 11:08 AM, Otis Gospodnetic

Re: New Producer - ONLY sync mode?

2015-02-02 Thread Pradeep Gollakota
I looked at the newly added batch API to Kinesis for inspiration. The
response on the batch put is a list of message-ids and their status (offset
if success else a failure code).

Ideally, I think the server should fail the entire batch or succeed the
entire batch (i.e. no duplicates), but this is pretty hard to implement.
Given that, what Kinesis did is probably good compromise (perhaps while we
wait for exactly once semantics :))

In addition, perhaps adding a flush() method to the producer to allow for
control over when flushes happen might be another good starting point. With
the addition of a flush, it's easier to implement a SyncProducer by doing
something like, flush() - n x send() - flush(). This doesn't guarantee
that a particular batch isn't broken into two, but with sane batch sizes
and sane record sizes, we can assume the guarantee.

On Mon, Feb 2, 2015 at 1:48 PM, Gwen Shapira gshap...@cloudera.com wrote:

 I've been thinking about that too, since both Flume and Sqoop rely on
 send(List) API of the old API.

 I'd like to see this API come back, but I'm debating how we'd handle
 errors. IIRC, the old API would fail an entire batch on a single
 error, which can lead to duplicates. Having N callbacks lets me retry
 / save / whatever just the messages that had issues.

 If messages had identifiers from the producer side, we could have the
 API call the callback with a list of message-ids and their status. But
 they don't :)

 Any thoughts on how you'd like it to work?

 Gwen


 On Mon, Feb 2, 2015 at 1:38 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:
  This is a great question Otis. Like Gwen said, you can accomplish Sync
 mode
  by setting the batch size to 1. But this does highlight a shortcoming of
  the new producer API.
 
  I really like the design of the new API and it has really great
 properties
  and I'm enjoying working with it. However, once API that I think we're
  lacking is a batch API. Currently, I have to iterate over a batch and
  call .send() on each record, which returns n callbacks instead of 1
  callback for the whole batch. This significantly complicates recovery
 logic
  where we need to commit a batch as opposed 1 record at a time.
 
  Do you guys have any plans to add better semantics around batches?
 
  On Mon, Feb 2, 2015 at 1:34 PM, Gwen Shapira gshap...@cloudera.com
 wrote:
 
  If I understood the code and Jay correctly - if you wait for the
  future it will be a similar delay to that of the old sync producer.
 
  Put another way, if you test it out and see longer delays than the
  sync producer had, we need to find out why and fix it.
 
  Gwen
 
  On Mon, Feb 2, 2015 at 1:27 PM, Otis Gospodnetic
  otis.gospodne...@gmail.com wrote:
   Hi,
  
   Nope, unfortunately it can't do that.  X is a remote app, doesn't
 listen
  to
   anything external, calls Y via HTTPS.  So X has to decide what to do
 with
   its data based on Y's synchronous response.  It has to block until Y
   responds.  And it wouldn't be pretty, I think, because nobody wants to
  run
   apps that talk to remove servers and hang on to connections more than
  they
   have to.  But perhaps that is the only way?  Or maybe the answer to
 I'm
   guessing the delay would be more or less the same as if the Producer
 was
   using SYNC mode? is YES, in which case the connection from X to Y
 would
  be
   open for just as long as with a SYNC producer running in Y?
  
   Thanks,
   Otis
   --
   Monitoring * Alerting * Anomaly Detection * Centralized Log Management
   Solr  Elasticsearch Support * http://sematext.com/
  
  
   On Mon, Feb 2, 2015 at 4:03 PM, Gwen Shapira gshap...@cloudera.com
  wrote:
  
   Can Y have a callback that will handle the notification to X?
   In this case, perhaps Y can be async and X can buffer the data until
   the callback triggers and says all good (or resend if the callback
   indicates an error)
  
   On Mon, Feb 2, 2015 at 12:56 PM, Otis Gospodnetic
   otis.gospodne...@gmail.com wrote:
Hi,
   
Thanks for the info.  Here's the use case.  We have something up
  stream
sending data, say a log shipper called X.  It sends it to some
 remote
component Y.  Y is the Kafka Producer and it puts data into Kafka.
  But Y
needs to send a reply to X and tell it whether it successfully put
 all
   its
data into Kafka.  If it did not, Y wants to tell X to buffer data
  locally
and resend it later.
   
If producer is ONLY async, Y can't easily do that.  Or maybe Y
 would
  just
need to wait for the Future to come back and only then send the
  response
back to X?  If so, I'm guessing the delay would be more or less the
  same
   as
if the Producer was using SYNC mode?
   
Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log
 Management
Solr  Elasticsearch Support * http://sematext.com/
   
   
On Mon, Feb 2, 2015 at 3:13 PM, Jay Kreps jay.kr...@gmail.com
  wrote:
   
Yeah as Gwen says

Re: Kafka ETL Camus Question

2015-02-02 Thread Pradeep Gollakota
Hi Bhavesh,

At Lithium, we don't run Camus in our pipelines yet, though we plan to. But
I just wanted to comment regarding speculative execution. We have it
disabled at the cluster level and typically don't need it for most of our
jobs. Especially with something like Camus, I don't see any need to run
parallel copies of the same task.

On Mon, Feb 2, 2015 at 10:36 PM, Bhavesh Mistry mistry.p.bhav...@gmail.com
wrote:

 Hi Jun,

 Thanks for info.  I did not get answer  to my question there so I thought I
 try my luck here :)

 Thanks,

 Bhavesh

 On Mon, Feb 2, 2015 at 9:46 PM, Jun Rao j...@confluent.io wrote:

  You can probably ask the Camus mailing list.
 
  Thanks,
 
  Jun
 
  On Thu, Jan 29, 2015 at 1:59 PM, Bhavesh Mistry 
  mistry.p.bhav...@gmail.com
  wrote:
 
   Hi Kafka Team or Linked-In  Team,
  
   I would like to know if you guys run Camus ETL job with speculative
   execution true or false.  Does it make sense to set this to false ?
  Having
   true, it creates additional load on brokers for each map task (create a
  map
   task to pull same partition twice).  Is there any advantage to this
  having
   it on vs off ?
  
   mapred.map.tasks.speculative.execution
  
   Thanks,
  
   Bhavesh
  
 



Re: and Comparison Operators not working

2015-02-02 Thread Pradeep Gollakota
Explicit casting will work, though you shouldn't need to use it. You should
specify an input schema using the AS keyword. This will ensure that
PigStorage will load your data using the appropriate types.

On Mon, Feb 2, 2015 at 7:22 AM, Arvind S arvind18...@gmail.com wrote:

 Use explicit casting during comparison

 Cheers !!!
 Arvind
 On 02-Feb-2015 8:39 pm, Amit am...@yahoo.com.invalid wrote:

  Thanks for the response.The Pig script as such does not fail, it runs
  successfully ( trying in local mode), however when the run is finished it
  does not dump any tuples.Has it something to do with the CSV where the f1
  is stored as a string ?The CSV data would look like this -
 
 *10,abc20,xyz30,lmn...
  etc ***
  Thanks,Amit
 
   On Monday, February 2, 2015 3:37 AM, Pradeep Gollakota 
  pradeep...@gmail.com wrote:
 
 
   Just to clarify, do you have a semicolon after f1  20?
 
  A = LOAD 'data' USING PigStorage(',');
  B = FOREACH A GENERATE f1;
  C = FILTER B BY f1  20;
  DUMP C;
 
  This should be correct.
  ​
 
  On Sun, Feb 1, 2015 at 4:50 PM, Amit am...@yahoo.com.invalid wrote:
 
   Hello,I am trying to run a Ad-hoc pig script on IBM Bluemix platform
 that
   has a arithmetic comparison.Suppose the data is f1-10203040..
   Let us say I would like to select the records where f1  20 . It is
  pretty
   easy operation, however I am not sure why I cannot see expected results
  in
   there.The data is initially loaded from a CSV file.Here is may pig
 script
   -
 
 A
   =  Load from CSV file  B = FOREACH A generate f1;C  = FILTER B by
 f1
  
   20DUMP
  
 
 C;
Appreciate if someone points out what I am doing wrong here.
   I also tried to run this in local mode just to make sure I am doing
 this
   right.
   Regards,Amit
 
 



Re: and Comparison Operators not working

2015-02-02 Thread Pradeep Gollakota
Just to clarify, do you have a semicolon after f1  20?

A = LOAD 'data' USING PigStorage(',');
B = FOREACH A GENERATE f1;
C = FILTER B BY f1  20;
DUMP C;

This should be correct.
​

On Sun, Feb 1, 2015 at 4:50 PM, Amit am...@yahoo.com.invalid wrote:

 Hello,I am trying to run a Ad-hoc pig script on IBM Bluemix platform that
 has a arithmetic comparison.Suppose the data is f1-10203040..
 Let us say I would like to select the records where f1  20 . It is pretty
 easy operation, however I am not sure why I cannot see expected results in
 there.The data is initially loaded from a CSV file.Here is may pig script
 - 
 A
 =  Load from CSV file  B = FOREACH A generate f1;C  = FILTER B by f1 
 20DUMP
 C;
  Appreciate if someone points out what I am doing wrong here.
 I also tried to run this in local mode just to make sure I am doing this
 right.
 Regards,Amit


Pattern for pig/hive setup files

2015-01-27 Thread Pradeep Gollakota
Hi All,

I'm trying to establish a good pattern and practice with Oozie for sharing
a setup file for pig/hive. For example, I have several scripts that use a
set of UDFs that are built in-house. In order to use the UDF, I need to add
the jar file, and then register the UDF. Rather than repeating this process
multiple times, I can simply put this in the .hiverc/.pigbootup file of the
user which will be running these scripts. However, this does not translate
well into Oozieland. Another option is to put the common code into a
setup file and add source setup to the beginning of every script. This
is kind of ugly in that each file will have the same source setup
command. There's another problem of the path'ing not necessarily being
translatable across scrips/workflows.

Has anyone come up with a good pattern for sharing hive/pig code across
scripts in Oozieland?

Thanks,
Pradeep


Re: solr indexing using pig script

2015-01-16 Thread Pradeep Gollakota
It looks like your only option then is to use two separate scripts. It's
not ideal because you have twice the I/O, but it should work.

P.S. make sure to guy reply all so the list is kept in the loop.
On Jan 15, 2015 11:41 PM, Vishnu Viswanath vishnu.viswanat...@gmail.com
wrote:

 Thanks Pradeep for the suggestion.

 I am using zookeeper to store into SOLR. So my location is the zookeeper
 server. I followed this link for doing the same:
 https://docs.lucidworks.com/plugins/servlet/mobile#content/view/24380610

 Is there a better way of doing it if I am using zookeeper?

 Regards,
 Vishnu Viswanath


  On 16-Jan-2015, at 12:34, Pradeep Gollakota pradeep...@gmail.com
 wrote:
 
  Just out of curiosity, why are you using SET to set the solr collection?
  I'm not sure if you're using an out of the box Load/Store Func, but if I
  were to design it, I would use the location of a Load/Store Func to
  specify which solr collection to write to.
 
  Is it possible for you to redesign this way?
 
  On Thu, Jan 15, 2015 at 9:41 PM, Vishnu Viswanath 
  vishnu.viswanat...@gmail.com wrote:
 
  Thanks
 
  SET sets the SOLR collection name. When the STORE is invoked, the data
  will be ingested into the collection name set before.
 
  So, the problem must be because  the second set is overriding the
  collection name and the STORE is failing.
 
  Is there any way to overcome this? Because most of the processing time
 is
  taken in the load and I don't want to do it twice.
 
  Regards,
  Vishnu Viswanath
 
  On 16-Jan-2015, at 09:29, Cheolsoo Park piaozhe...@gmail.com wrote:
 
  What does SET do for Solr? Pig pre-processes all the set commands in
  the
  entire script before executing any query, and values are overwritten if
  the
  same key is set more than once. In your example, you have two set
  commands.
  If you're thinking that different values will be applied in each
 section,
  that's not the case. e) will overwrite a).
 
 
  On Thu, Jan 15, 2015 at 7:46 PM, Vishnu Viswanath 
  vishnu.viswanat...@gmail.com wrote:
 
  Hi All,
 
  I am in indexing data into solr using pig script.
  I have two such scripts, and I tried combining these two scripts into
 a
  single one.
 
  i.e., i have script 1 that does
  
  a)SET solr collection info for collection 1
  b)LOAD data
  c)FILTER data for SOLR collection number 1
  d)STORE data to solr
 
 
  and script 2 that does
  ---
  a)SET solr collection info for collection 2
  b)LOAD data
  c)FILTER data for SOLR collection number 2
  d)STORE data to solr
 
 
  combined script looks something like
  --
  a)SET solr collection info for collection 1
  b)LOAD data
  c)FILTER data from (b) for SOLR collection number 1
  d)STORE data to solr
  e)SET solr collection info for collection 2
  f)FILTER data from (b) for SOLR collection number 2
  g)STORE data to solr
 
  But the store function fails when I run the combined script where as
 it
  runs fine if I run scripts 1 and 2 separately.
 
  Any idea?
 
  Regards,
  Vishnu
 
  --001a11c13bfcdc3d7f050cbf93c1--



Re: Is ther a way to run one test of special unit test?

2015-01-16 Thread Pradeep Gollakota
If you're using maven AND using surefire plugin 2.7.3+ AND using Junit 4,
then you can do this by specifying -Dtest=TestClass#methodName

ref:
http://maven.apache.org/surefire/maven-surefire-plugin/examples/single-test.html

On Thu, Jan 15, 2015 at 8:02 PM, Cheolsoo Park piaozhe...@gmail.com wrote:

 I don't think you can disable test cases on the fly in JUnit. You will need
 to add @Ignore annotation and recompile the test file. Correct me if I am
 wrong.

 On Thu, Jan 15, 2015 at 6:55 PM, lulynn_2008 lulynn_2...@163.com wrote:

  Hi All,
 
  There are multiple tests in one Test* file. Is there a way to just run
  only one pointed test?
 
  Thanks
 



Re: solr indexing using pig script

2015-01-15 Thread Pradeep Gollakota
Just out of curiosity, why are you using SET to set the solr collection?
I'm not sure if you're using an out of the box Load/Store Func, but if I
were to design it, I would use the location of a Load/Store Func to
specify which solr collection to write to.

Is it possible for you to redesign this way?

On Thu, Jan 15, 2015 at 9:41 PM, Vishnu Viswanath 
vishnu.viswanat...@gmail.com wrote:

 Thanks

 SET sets the SOLR collection name. When the STORE is invoked, the data
 will be ingested into the collection name set before.

 So, the problem must be because  the second set is overriding the
 collection name and the STORE is failing.

 Is there any way to overcome this? Because most of the processing time is
 taken in the load and I don't want to do it twice.

 Regards,
 Vishnu Viswanath

  On 16-Jan-2015, at 09:29, Cheolsoo Park piaozhe...@gmail.com wrote:
 
  What does SET do for Solr? Pig pre-processes all the set commands in
 the
  entire script before executing any query, and values are overwritten if
 the
  same key is set more than once. In your example, you have two set
 commands.
  If you're thinking that different values will be applied in each section,
  that's not the case. e) will overwrite a).
 
 
  On Thu, Jan 15, 2015 at 7:46 PM, Vishnu Viswanath 
  vishnu.viswanat...@gmail.com wrote:
 
  Hi All,
 
  I am in indexing data into solr using pig script.
  I have two such scripts, and I tried combining these two scripts into a
  single one.
 
  i.e., i have script 1 that does
  
  a)SET solr collection info for collection 1
  b)LOAD data
  c)FILTER data for SOLR collection number 1
  d)STORE data to solr
 
 
  and script 2 that does
  ---
  a)SET solr collection info for collection 2
  b)LOAD data
  c)FILTER data for SOLR collection number 2
  d)STORE data to solr
 
 
  combined script looks something like
  --
  a)SET solr collection info for collection 1
  b)LOAD data
  c)FILTER data from (b) for SOLR collection number 1
  d)STORE data to solr
  e)SET solr collection info for collection 2
  f)FILTER data from (b) for SOLR collection number 2
  g)STORE data to solr
 
  But the store function fails when I run the combined script where as it
  runs fine if I run scripts 1 and 2 separately.
 
  Any idea?
 
  Regards,
  Vishnu
 



Re: Is ther a way to run one test of special unit test?

2015-01-15 Thread Pradeep Gollakota
If you're using maven AND using surefire plugin 2.7.3+ AND using Junit 4,
then you can do this by specifying -Dtest=TestClass#methodName

ref:
http://maven.apache.org/surefire/maven-surefire-plugin/examples/single-test.html

On Thu, Jan 15, 2015 at 8:02 PM, Cheolsoo Park piaozhe...@gmail.com wrote:

 I don't think you can disable test cases on the fly in JUnit. You will need
 to add @Ignore annotation and recompile the test file. Correct me if I am
 wrong.

 On Thu, Jan 15, 2015 at 6:55 PM, lulynn_2008 lulynn_2...@163.com wrote:

  Hi All,
 
  There are multiple tests in one Test* file. Is there a way to just run
  only one pointed test?
 
  Thanks
 



Re: [akka-user] BalancingPool with custom mailbox

2015-01-05 Thread Pradeep Gollakota
Hi Patrik,

Thanks for the response. Is there a work around for this I can employ? Is 
it possible to use a custom mailbox with the old balancing dispatcher 
(from pre 2.3)?

On Friday, January 2, 2015 7:05:02 AM UTC-8, Patrik Nordwall wrote:

 Hi Pradeep,

 Custom mailbox with balancing pool is not possible with Akka 2.3.x.
 It might be supported later, see https://github.com/akka/akka/issues/13961 
 and https://github.com/akka/akka/issues/13964

 Regards,
 Patrik

 On Tue, Dec 30, 2014 at 8:58 PM, Pradeep Gollakota prade...@gmail.com 
 javascript: wrote:

 Hi All,

 I’m trying to create an ActorSystem where a set of actors have a shared 
 mailbox that’s prioritized. I’ve tested my mailbox without using the 
 BalancingPool router, and the messages are correctly prioritized. However, 
 when I try to create the actors using BalancingPool, the messages are no 
 longer prioritized. How do I create a BalancingPool router with a custom 
 mailbox?

 With the following code, the messages are not prioritized:

 val system = ActorSystem(MySystem)
 val actor = 
 system.actorOf(BalancingPool(1).props(Props[MyActor]).withMailbox(my-mailbox),
  myactor)

 With the following code, the messages are prioritized correctly.

 val system = ActorSystem(MySystem)
 val actor = system.actorOf(Props[MyActor].withMailbox(my-mailbox), 
 myactor)

 Thanks in advance,
 Pradeep
 ​

 -- 
  Read the docs: http://akka.io/docs/
  Check the FAQ: 
 http://doc.akka.io/docs/akka/current/additional/faq.html
  Search the archives: https://groups.google.com/group/akka-user
 --- 
 You received this message because you are subscribed to the Google Groups 
 Akka User List group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to akka-user+...@googlegroups.com javascript:.
 To post to this group, send email to akka...@googlegroups.com 
 javascript:.
 Visit this group at http://groups.google.com/group/akka-user.
 For more options, visit https://groups.google.com/d/optout.




 -- 

 Patrik Nordwall
 Typesafe http://typesafe.com/ -  Reactive apps on the JVM
 Twitter: @patriknw

 

-- 
  Read the docs: http://akka.io/docs/
  Check the FAQ: 
 http://doc.akka.io/docs/akka/current/additional/faq.html
  Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups Akka 
User List group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.


[akka-user] BalancingPool with custom mailbox

2014-12-30 Thread Pradeep Gollakota


Hi All,

I’m trying to create an ActorSystem where a set of actors have a shared 
mailbox that’s prioritized. I’ve tested my mailbox without using the 
BalancingPool router, and the messages are correctly prioritized. However, 
when I try to create the actors using BalancingPool, the messages are no 
longer prioritized. How do I create a BalancingPool router with a custom 
mailbox?

With the following code, the messages are not prioritized:

val system = ActorSystem(MySystem)
val actor = 
system.actorOf(BalancingPool(1).props(Props[MyActor]).withMailbox(my-mailbox),
 myactor)

With the following code, the messages are prioritized correctly.

val system = ActorSystem(MySystem)
val actor = system.actorOf(Props[MyActor].withMailbox(my-mailbox), myactor)

Thanks in advance,
Pradeep
​

-- 
  Read the docs: http://akka.io/docs/
  Check the FAQ: 
 http://doc.akka.io/docs/akka/current/additional/faq.html
  Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups Akka 
User List group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.


Re: Max. storage for Kafka and impact

2014-12-19 Thread Pradeep Gollakota
@Joe, Achanta is using Indian English numerals which is why it's a little
confusing. http://en.wikipedia.org/wiki/Indian_English#Numbering_system
1,00,000 [1 lakh] (Indian English) == 100,000 [1 hundred thousand] (The
rest of the world :P)

On Fri Dec 19 2014 at 9:40:29 AM Achanta Vamsi Subhash 
achanta.va...@flipkart.com wrote:

 Joe,

 - Correction, it's 1,00,000 partitions
 - We can have at max only 1 consumer/partition. Not 50 per 1 partition.
 Yes, we have a hashing mechanism to support future partition increase as
 well. We override the Default Partitioner.
 - We use both Simple and HighLevel consumers depending on the consumption
 use-case.
 - I clearly mentioned that 200 TB/week and not a day.
 - We have separate producers and consumers, each operating as different
 processes in different machines.

 I was explaining why we may end up with so many partitions. I think the
 question about 200 TB/day got deviated.

 Any suggestions reg. the performance impact of the 200TB/week?

 On Fri, Dec 19, 2014 at 10:53 PM, Joe Stein joe.st...@stealth.ly wrote:
 
  Wait, how do you get 2,000 topics each with 50 partitions == 1,000,000
  partitions? I think you can take what I said below and change my 250 to
 25
  as I went with your result (1,000,000) and not your arguments (2,000 x
 50).
 
  And you should think on the processing as a separate step from fetch and
  commit your offset in batch post processing. Then you only need more
  partitions to fetch batches to process in parallel.
 
  Regards, Joestein
 
  On Fri, Dec 19, 2014 at 12:01 PM, Joe Stein joe.st...@stealth.ly
 wrote:
  
   see some comments inline
  
   On Fri, Dec 19, 2014 at 11:30 AM, Achanta Vamsi Subhash 
   achanta.va...@flipkart.com wrote:
  
   We require:
   - many topics
   - ordering of messages for every topic
  
  
   Ordering is only on a per partition basis so you might have to pick a
   partition key that makes sense for what you are doing.
  
  
   - Consumers hit different Http EndPoints which may be slow (in a push
   model). In case of a Pull model, consumers may pull at the rate at
 which
   they can process.
   - We need parallelism to hit with as many consumers. Hence, we
 currently
   have around 50 consumers/topic = 50 partitions.
  
  
   I think you might be mixing up the fetch with the processing. You can
  have
   1 partition and still have 50 message being processed in parallel (so a
   batch of messages).
  
   What language are you working in? How are you doing this processing
   exactly?
  
  
  
   Currently we have:
   2000 topics x 50 = 1,00,000 partitions.
  
  
   If this is really the case then you are going to need at least 250
  brokers
   (~ 4,000 partitions per broker).
  
   If you do that then you are in the 200TB per day world which doesn't
  sound
   to be the case.
  
   I really think you need to strategize more on your processing model
 some
   more.
  
  
  
   The incoming rate of ingestion at max is 100 MB/sec. We are planning
  for a
   big cluster with many brokers.
  
  
   It is possible to handle this on just 3 brokers depending on message
  size,
   ability to batch, durability are also factors you really need to be
   thinking about.
  
  
  
   We have exactly the same use cases as mentioned in this video (usage
 at
   LinkedIn):
   https://www.youtube.com/watch?v=19DvtEC0EbQ​
  
   ​To handle the zookeeper scenario, as mentioned in the above video, we
  are
   planning to use SSDs​ and would upgrade to the new consumer (0.9+)
 once
   its
   available as per the below video.
   https://www.youtube.com/watch?v=7TZiN521FQA
  
   On Fri, Dec 19, 2014 at 9:06 PM, Jayesh Thakrar
   j_thak...@yahoo.com.invalid
wrote:
  
Technically/conceptually it is possible to have 200,000 topics, but
 do
   you
really need it like that?What do you intend to do with those
 messages
  -
i.e. how do you forsee them being processed downstream? And are
 those
topics really there to segregate different kinds of processing or
   different
ids?E.g. if you were LinkedIn, Facebook or Google, would you have
 have
   one
topic per user or one topic per kind of event (e.g. login, pageview,
adview, etc.)Remember there is significant book-keeping done within
Zookeeper - and these many topics will make that book-keeping
   significant.
As for storage, I don't think it should be an issue with sufficient
spindles, servers and higher than default memory configuration.
Jayesh
  From: Achanta Vamsi Subhash achanta.va...@flipkart.com
 To: users@kafka.apache.org users@kafka.apache.org
 Sent: Friday, December 19, 2014 9:00 AM
 Subject: Re: Max. storage for Kafka and impact
   
Yes. We need those many max partitions as we have a central
 messaging
service and thousands of topics.
   
On Friday, December 19, 2014, nitin sharma 
  kumarsharma.ni...@gmail.com
   
wrote:
   
 hi,

 Few things you have to plan for:
 a. Ensure that from 

Re: Efficient use of buffered writes in a post-HTablePool world?

2014-12-19 Thread Pradeep Gollakota
Hi Aaron,

Just out of curiosity, have you considered using asynchbase?
https://github.com/OpenTSDB/asynchbase


On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk ndimi...@apache.org wrote:

 Hi Aaron,

 Your analysis is spot on and I do not believe this is by design. I see the
 write buffer is owned by the table, while I would have expected there to be
 a buffer per table all managed by the connection. I suggest you raise a
 blocker ticket vs the 1.0.0 release that's just around the corner to give
 this the attention it needs. Let me know if you're not into JIRA, I can
 raise one on your behalf.

 cc Lars, Enis.

 Nice work Aaron.
 -n

 On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu abe...@siftscience.com
 wrote:
 
  Hi All,
 
  TLDR; in the absence of HTablePool, if HTable instances are short-lived,
  how should clients use buffered writes?
 
  I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to 0.98.6
  (CDH5.2). One issue I’m confused by is how to effectively use buffered
  writes now that HTablePool has been deprecated[1].
 
  In our 0.94 code, a pathway could get a table from the pool, configure it
  with table.setAutoFlush(false); and write Puts to it. Those writes would
  then go to the table instance’s writeBuffer, and those writes would only
 be
  flushed when the buffer was full, or when we were ready to close out the
  pool. We were intentionally choosing to have fewer, larger writes from
 the
  client to the cluster, and we knew we were giving up a degree of safety
 in
  exchange (i.e. if the client dies after it’s accepted a write but before
  the flush for that write occurs, the data is lost). This seems to be a
  generally considered a reasonable choice (cf the HBase Book [2] SS
 14.8.4)
 
  However in the 0.98 world, without HTablePool, the endorsed pattern [3]
  seems to be to create a new HTable via table =
  stashedHConnection.getTable(tableName, myExecutorService). However, even
 if
  we do table.setAutoFlush(false), because that table instance is
  short-lived, its buffer never gets full. We’ll create a table instance,
  write a put to it, try to close the table, and the close call will
 trigger
  a (synchronous) flush. Thus, not having HTablePool seems like it would
  cause us to have many more small writes from the client to the cluster,
 and
  basically wipe out the advantage of turning off autoflush.
 
  More concretely :
 
  // Given these two helpers ...
 
  private HTableInterface getAutoFlushTable(String tableName) throws
  IOException {
// (autoflush is true by default)
return storedConnection.getTable(tableName, executorService);
  }
 
  private HTableInterface getBufferedTable(String tableName) throws
  IOException {
HTableInterface table = getAutoFlushTable(tableName);
table.setAutoFlush(false);
return table;
  }
 
  // it's my contention that these two methods would behave almost
  identically,
  // except the first will hit a synchronous flush during the put call,
  and the second will
  // flush during the (hidden) close call on table.
 
  private void writeAutoFlushed(Put somePut) throws IOException {
try (HTableInterface table = getAutoFlushTable(tableName)) {
  table.put(somePut); // will do synchronous flush
}
  }
 
  private void writeBuffered(Put somePut) throws IOException {
try (HTableInterface table = getBufferedTable(tableName)) {
  table.put(somePut);
} // auto-close will trigger synchronous flush
  }
 
  It seems like the only way to avoid this is to have long-lived HTable
  instances, which get reused for multiple writes. However, since the
 actual
  writes are driven from highly concurrent code, and since HTable is not
  threadsafe, this would involve having a number of HTable instances, and a
  control mechanism for leasing them out to individual threads safely.
 Except
  at this point it seems like we will have recreated HTablePool, which
  suggests that we’re doing something deeply wrong.
 
  What am I missing here? Since the HTableInterface.setAutoFlush method
 still
  exists, it must be anticipated that users will still want to buffer
 writes.
  What’s the recommended way to actually buffer a meaningful number of
  writes, from a multithreaded context, that doesn’t just amount to
 creating
  a table pool?
 
  Thanks in advance,
  Aaron
 
  [1] https://issues.apache.org/jira/browse/HBASE-6580
  [2] http://hbase.apache.org/book/perf.writing.html
  [3]
 
 
 https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
  ​
 



Re: Reduce load to hbase

2014-12-07 Thread Pradeep Gollakota
This doesn't answer your question per se, but this is how we dealt with
load on HBase at Lithium. We power klout.com with HBase. On a nightly
basis, we load user profile data and Klout scores for approx. 600 million
users into HBase. We also do maintenance on HBase such as major compactions
on a regular basis. When either a load or maintenance is being performed,
site performance on klout.com used to degrade pretty severely. In order to
mitigate this, we stood up 2 HBase clusters and now power klout.com off
both. We run these in a custom built active/passive mode. The application
layer uses a zookeeper flag to connect to the active cluster and serves
from there. We load data or do maintenance on the passive, then flip the
clusters so repeat the load/maintenance on the previously active cluster.
This mechanism of active/passive systems has been working pretty well for
us. It does however require a significant cost in terms of maintaining 2
clusters.

On Sun Dec 07 2014 at 9:35:00 AM gomes sankarm...@gmail.com wrote:

 Currently some system cleaning tasks do read all the rows, and then perform
 some operations on that. It impacts other users who are being served at the
 same time. System cleaning tasks are of lower priority, and I can delay the
 requests, but I am just wondering if there is anyway I can hook into hbase
 system, and if I can continuously measure the load of the system, and based
 on that I can limit the lower priority tasks? How can I do that, if there
 are any pointers, or helpful suggestions, please provide them.



 --
 View this message in context: http://apache-hbase.679495.n3.
 nabble.com/Reduce-load-to-hbase-tp4066727.html
 Sent from the HBase User mailing list archive at Nabble.com.



Re: What companies are using HBase to serve a customer-facing product?

2014-12-06 Thread Pradeep Gollakota
Lithium (Klout) powers www.klout.com with HBase. The operations team is 2
full time engineers + the manager (who also does hands on operations work
with the team). This operations team is responsible for the entirety of our
Hadoop stack including the HBase clusters. We have one 165 node Hive
cluster for Data Science and 5 HBase clusters (of varying sizes), 2 of
which are used to power klout.com.

We have strong SLA requirements for klout.com as it is a user facing
product. I don't remember the sizing of our HBase clusters off hand but
they are substantial enough to load user profile data and Klout scores for
approximately 600 million users on a daily basis. I believe the data set is
in the order of several terabytes.

On Sat Dec 06 2014 at 8:49:37 PM lars hofhansl la...@apache.org wrote:

 For expected latency, read this: http://hadoop-hbase.blogspot.
 com/2014/08/hbase-client-response-times.htmlFor cluster/machine sizing
 this might be helpful: http://hadoop-hbase.blogspot.
 com/2013/01/hbase-region-server-memory-sizing.html  Disclaimer: I wrote
 these two posts.
 -- Lars

   From: jeremy p athomewithagroove...@gmail.com
  To: user@hbase.apache.org
  Sent: Friday, December 5, 2014 1:37 PM
  Subject: What companies are using HBase to serve a customer-facing
 product?

 Hey all,

 So, I'm currently evaluating HBase as a solution for querying a very large
 data set (think 60+ TB). We'd like to use it to directly power a
 customer-facing product. My question is threefold :

 1) What companies use HBase to serve a customer-facing product? I'm not
 interested in evaluations, experiments, or POC.  I'm also not interested in
 offline BI or analytics.  I'm specifically interested in cases where HBase
 serves as the data store for a customer-facing product.

 2) Of the companies that use HBase to serve a customer-facing product,
 which ones use it to query data sets of 60TB or more?

 3) Of companies use HBase to query 60+ TB data sets and serve a
 customer-facing product, how many employees are required to support their
 HBase installation?  In other words, if I were to start a team tomorrow,
 and their purpose was to maintain a 60+ TB HBase installation for a
 customer-facing product, how many people should I hire?

 4) Of companies use HBase to query 60+ TB data sets and serve a
 customer-facing product, what kind of measures do they take for disaster
 recovery?

 If you can, please point me to articles, videos, and other materials.
 Obviously, the larger the company, the better case it will make for HBase.

 Thank you!





Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
Java string's are immutable. So pdfText.concat() returns a new string and
the original string is left unmolested. So at the end, all you're doing is
returning an empty string. Instead, you can do pdfText =
pdfText.concat(...). But the better way to write it is to use a
StringBuilder.

StringBuilder pdfText = ...;
pdfText.append(...);
pdfText.append(...);
...
return pdfText.toString();

On Fri Dec 05 2014 at 9:12:37 AM Ryan freelanceflashga...@gmail.com wrote:

 Hi,

 I'm working on an open source project attempting to convert raw content
 from a pdf (stored as a databytearray) into plain text using a Pig UDF and
 Apache Tika. I could use your help. For some reason, the UDF I'm using
 isn't working. The script succeeds but no output is written. *This is the
 Pig script I'm following:*

 register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
 DEFINE ExtractTextFromPDFs
  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
 DEFINE ArcLoader org.warcbase.pig.ArcLoader();

 raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray, date:
 chararray, mime: chararray, content: bytearray); --load the data

 a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from the
 arc file
 b = LIMIT a 2; --limit to 2 pages to speed up testing time
 c = foreach b generate url, ExtractTextFromPDFs(content);
 store c into 'output/pdf_test';


 *This is the UDF I wrote:*

 public class ExtractTextFromPDFs extends EvalFuncString {

   @Override
   public String exec(Tuple input) throws IOException {
   String pdfText = ;

   if (input == null || input.size() == 0 || input.get(0) == null) {
   return N/A;
   }

   DataByteArray dba = (DataByteArray)input.get(0);
   pdfText.concat(String.valueOf(dba.size())); //my attempt at
 debugging. Nothing written

   InputStream is = new ByteArrayInputStream(dba.get());

   ContentHandler contenthandler = new BodyContentHandler();
   Metadata metadata = new Metadata();
   DefaultDetector detector = new DefaultDetector();
   AutoDetectParser pdfparser = new AutoDetectParser(detector);

   try {
 pdfparser.parse(is, contenthandler, metadata, new ParseContext());
   } catch (SAXException | TikaException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
   }
   pdfText.concat( : ); //another attempt at debugging. Still nothing
 written
   pdfText.concat(contenthandler.toString());

   //close the input stream
   if(is != null){
 is.close();
   }
   return pdfText;
   }

 }

 Thank you for your assistance,
 Ryan



Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
I forgot to mention earlier that you should probably move the PdfParser
initialization code out of the evaluate method. This will probably cause a
significant overhead both in terms of gc and runtime performance. You'll
want to initialize your parser once and evaluate all your docs against it.

- Pradeep

On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota pradeep...@gmail.com
wrote:

 Java string's are immutable. So pdfText.concat() returns a new string
 and the original string is left unmolested. So at the end, all you're doing
 is returning an empty string. Instead, you can do pdfText =
 pdfText.concat(...). But the better way to write it is to use a
 StringBuilder.

 StringBuilder pdfText = ...;
 pdfText.append(...);
 pdfText.append(...);
 ...
 return pdfText.toString();

 On Fri Dec 05 2014 at 9:12:37 AM Ryan freelanceflashga...@gmail.com
 wrote:

 Hi,

 I'm working on an open source project attempting to convert raw content
 from a pdf (stored as a databytearray) into plain text using a Pig UDF and
 Apache Tika. I could use your help. For some reason, the UDF I'm using
 isn't working. The script succeeds but no output is written. *This is the
 Pig script I'm following:*

 register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
 DEFINE ExtractTextFromPDFs
  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
 DEFINE ArcLoader org.warcbase.pig.ArcLoader();

 raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray,
 date:
 chararray, mime: chararray, content: bytearray); --load the data

 a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from
 the
 arc file
 b = LIMIT a 2; --limit to 2 pages to speed up testing time
 c = foreach b generate url, ExtractTextFromPDFs(content);
 store c into 'output/pdf_test';


 *This is the UDF I wrote:*

 public class ExtractTextFromPDFs extends EvalFuncString {

   @Override
   public String exec(Tuple input) throws IOException {
   String pdfText = ;

   if (input == null || input.size() == 0 || input.get(0) == null) {
   return N/A;
   }

   DataByteArray dba = (DataByteArray)input.get(0);
   pdfText.concat(String.valueOf(dba.size())); //my attempt at
 debugging. Nothing written

   InputStream is = new ByteArrayInputStream(dba.get());

   ContentHandler contenthandler = new BodyContentHandler();
   Metadata metadata = new Metadata();
   DefaultDetector detector = new DefaultDetector();
   AutoDetectParser pdfparser = new AutoDetectParser(detector);

   try {
 pdfparser.parse(is, contenthandler, metadata, new ParseContext());
   } catch (SAXException | TikaException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
   }
   pdfText.concat( : ); //another attempt at debugging. Still nothing
 written
   pdfText.concat(contenthandler.toString());

   //close the input stream
   if(is != null){
 is.close();
   }
   return pdfText;
   }

 }

 Thank you for your assistance,
 Ryan




Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
A static variable is not necessary... a simple instance variable is just
fine.

On Fri Dec 05 2014 at 2:27:53 PM Ryan freelanceflashga...@gmail.com wrote:

 After running it with updated code, it seems like the problem has to do
 with something related to Tika since my output says that my input is the
 correct number of bytes (i.e. it's actually being sent in correctly). Going
 to test further to narrow down the problem.

 Pradeep, would you recommend using a static variable inside the
 ExtractTextFromPDFs function to store the PdfParser once it has been
 initialized once? I'm still learning how to best do things within the
 Pig/MapReduce/Hadoop framework

 Ryan

 On Fri, Dec 5, 2014 at 1:35 PM, Ryan freelanceflashga...@gmail.com
 wrote:

  Thanks Pradeep! I'll give it a try and report back
 
  Ryan
 
  On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota pradeep...@gmail.com
 
  wrote:
 
  I forgot to mention earlier that you should probably move the PdfParser
  initialization code out of the evaluate method. This will probably
 cause a
  significant overhead both in terms of gc and runtime performance. You'll
  want to initialize your parser once and evaluate all your docs against
 it.
 
  - Pradeep
 
  On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota 
 pradeep...@gmail.com
  wrote:
 
   Java string's are immutable. So pdfText.concat() returns a new
 string
   and the original string is left unmolested. So at the end, all you're
  doing
   is returning an empty string. Instead, you can do pdfText =
   pdfText.concat(...). But the better way to write it is to use a
   StringBuilder.
  
   StringBuilder pdfText = ...;
   pdfText.append(...);
   pdfText.append(...);
   ...
   return pdfText.toString();
  
   On Fri Dec 05 2014 at 9:12:37 AM Ryan freelanceflashga...@gmail.com
   wrote:
  
   Hi,
  
   I'm working on an open source project attempting to convert raw
 content
   from a pdf (stored as a databytearray) into plain text using a Pig
 UDF
  and
   Apache Tika. I could use your help. For some reason, the UDF I'm
 using
   isn't working. The script succeeds but no output is written. *This is
  the
   Pig script I'm following:*
  
   register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
   DEFINE ExtractTextFromPDFs
org.warcbase.pig.piggybank.ExtractTextFromPDFs();
   DEFINE ArcLoader org.warcbase.pig.ArcLoader();
  
   raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray,
   date:
   chararray, mime: chararray, content: bytearray); --load the data
  
   a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages
 from
   the
   arc file
   b = LIMIT a 2; --limit to 2 pages to speed up testing time
   c = foreach b generate url, ExtractTextFromPDFs(content);
   store c into 'output/pdf_test';
  
  
   *This is the UDF I wrote:*
  
   public class ExtractTextFromPDFs extends EvalFuncString {
  
 @Override
 public String exec(Tuple input) throws IOException {
 String pdfText = ;
  
 if (input == null || input.size() == 0 || input.get(0) ==
 null) {
 return N/A;
 }
  
 DataByteArray dba = (DataByteArray)input.get(0);
 pdfText.concat(String.valueOf(dba.size())); //my attempt at
   debugging. Nothing written
  
 InputStream is = new ByteArrayInputStream(dba.get());
  
 ContentHandler contenthandler = new BodyContentHandler();
 Metadata metadata = new Metadata();
 DefaultDetector detector = new DefaultDetector();
 AutoDetectParser pdfparser = new AutoDetectParser(detector);
  
 try {
   pdfparser.parse(is, contenthandler, metadata, new
  ParseContext());
 } catch (SAXException | TikaException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
 }
 pdfText.concat( : ); //another attempt at debugging. Still
  nothing
   written
 pdfText.concat(contenthandler.toString());
  
 //close the input stream
 if(is != null){
   is.close();
 }
 return pdfText;
 }
  
   }
  
   Thank you for your assistance,
   Ryan
  
  
 
 
 



Re: Using Pig To Scan Hbase

2014-12-05 Thread Pradeep Gollakota
There is a built in storage handler for HBase. Take a look at the docs at
https://pig.apache.org/docs/r0.14.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html

It doesn't support dealing with salted rowkeys (or reverse timestamps) out
of the box, so you may have to munge with the data a little bit after it's
loaded to get what you want.

Hope this helps.
Pradeep

On Fri Dec 05 2014 at 9:55:04 AM Nishanth S nishanth.2...@gmail.com wrote:

 Hey folks,

 I am trying to write a map reduce in pig against my hbase table.I have a
 salting in my rowkey appended with  reverse timestamps ,so I guess the best
 way is to do a scan for all the dates that I require to pull out
 records.Does any one know if pig supports  hbase scan out of the box or  do
 we need  to write a udf for that.

 Thannks,
 Nishanth



Re: Custom FileInputFormat.class

2014-12-01 Thread Pradeep Gollakota
Can you expand on your use case a little bit please? It may be that you're
duplicating functionality.

You can take a look at the CombineFileInputFormat for inspiration. If this
is indeed taking a long time, one cheap to implement thing you can do is to
parallelize the calls to get block locations.

Another question to ask yourself is whether it is worth it to optimize this
portion. In many use cases, (certainly mine), the bottleneck is the running
job itself. So the launch overhead is comparatively minimal.

Hope this helps.
Pradeep

On Mon Dec 01 2014 at 8:38:30 AM 胡斐 hufe...@gmail.com wrote:

 Hi,

 I want to custom FileInputFormat.class. In order to determine which host
 the specific part of a file belongs to, I need to open the file in HDFS and
 read some information. It will take me nearly 500ms to open a file and get
 the information I need. But now I have thousands of files to deal with, so
 it would be a long time if I deal with all of them as the above.

 Is there any better solution to reduce the time when the number of files
 is large?

 Thanks in advance!
 Fei




Re: [protobuf] Parse a .proto file

2014-11-05 Thread Pradeep Gollakota


Ok… I finally figured out the work around for this. I use a separate .proto 
file that contains my custom options.

package com.lithum.pbnj;

import google/protobuf/descriptor.proto;

option java_package = com.lithium.pbnj;

message LiOptions {
optional bool isPii = 1 [default = false];
optional bool isEmail = 2 [default = false];
optional bool isIpAddress = 3 [default = false];
}

extend google.protobuf.FieldOptions {
optional LiOptions li_opts = 50101;
}

Then I compile this .proto into a .java and can use it. When a message uses 
this extension, I can figure out which fields use my options, I use the 
following code:

Descriptors.FileDescriptor fieldOptionsDesc = 
DescriptorProtos.FieldOptions.getDescriptor().getFile();
Descriptors.FileDescriptor extensionsDesc = 
Extensions.getDescriptor().getFile();
Descriptors.FileDescriptor[] files = new 
Descriptors.FileDescriptor[]{fieldOptionsDesc, extensionsDesc};

DescriptorProtos.FileDescriptorSet set = 
DescriptorProtos.FileDescriptorSet.parseFrom(
PBnJ.class.getResourceAsStream(/messages.desc));
DescriptorProtos.FileDescriptorProto messages = set.getFile(0);
Descriptors.FileDescriptor fileDesc = 
Descriptors.FileDescriptor.buildFrom(messages, files);
Descriptors.Descriptor md = 
fileDesc.findMessageTypeByName(MessagePublish);

SetDescriptors.FieldDescriptor piiFields = Sets.newHashSet();
for (Descriptors.FieldDescriptor fieldDescriptor : md.getFields()) {
DescriptorProtos.FieldOptions options = 
fieldDescriptor.getOptions();
UnknownFieldSet.Field field = 
options.getUnknownFields().asMap().get(Extensions.LI_OPTS_FIELD_NUMBER);
if (field != null) {
Extensions.LiOptions liOptions = 
Extensions.LiOptions.parseFrom(field.getLengthDelimitedList().get(0).toByteArray());
if (liOptions.getIsEmail() || liOptions.getIsIpAddress() || 
liOptions.getIsPii()) {
piiFields.add(fieldDescriptor);
System.out.println(fieldDescriptor.toProto());
}
}
}

On Saturday, November 1, 2014 4:26:35 AM UTC-7, Oliver wrote:

On 1 November 2014 02:24, Pradeep Gollakota prade...@gmail.com 
 javascript: wrote: 
  Confirmed... When I replaced the md variable with the compiled 
 Descriptor, 
  it worked. I didn't think I was mixing the descriptors, e.g. the 
  MessagePublish message is one that is produced via the compiled API and 
  parsed using the DynamicMessage API. The isPii extension has been 
 refactored 
  into a separate proto that is precompiled into my codebase. I.E. the 
  descriptor for MessagePublish should be loaded dynamically and the 
  descriptor for the FieldOption I'm defining won't be loaded dynamically. 
 As 
  far as I can tell, there shouldn't be any mixing of the descriptor 
 pools, 
  though I may be wrong. 

 This is exactly where the problem is, though - you have: 

 MessagePublish descriptor D1 (from encoded descriptorset) references 
 extension field F1 (from encoded descriptorset) 
 Message descriptor D2 (from pregenerated code) references extension 
 field F2 (from pregenerated code) 

 So if you have a message built from D1 then it thinks it has a field 
 F1; when you ask if it has extension F2 it says no! even though 
 they're really the same thing. 

  Any thoughts on how I can proceed with this project? 

 It seems like a flaw in the API .. In the case I ran into, I could 
 work around it as the processing code only wanted to work with 
 non-option extensions when it had the precompiled code for the 
 extension, so for those well-known message types I'd just look up the 
 descriptor to use from the precompiled set rather than using the 
 in-stream descriptor set (the message format included the descriptors 
 inline). That doesn't really work here though. 

 Can you inspect the options using field descriptors from the encoded 
 descriptorset, rather than using Messages.pii from the pregenerated 
 code? 

 Oliver 

​

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.


Re: [protobuf] Parse a .proto file

2014-10-31 Thread Pradeep Gollakota
Hi Oliver,

Thanks for the response! I guess my question wasn't quite clear. In my java 
code I have a string which contains the content of a .proto file. Given 
this string, how can I create an instance of a Descriptor class so I can do 
DynamicMessage parsing.

Thanks!
- Pradeep

On Thursday, October 30, 2014 2:41:19 PM UTC-7, Oliver wrote:

 On 30 October 2014 02:53, Pradeep Gollakota prade...@gmail.com 
 javascript: wrote: 

  I have a use case where I need to parse messages without having the 
  corresponding precompiled classes in Java. So the DynamicMessage seems 
 to be 
  the correct fit, but I'm not sure how I can generate the DescriptorSet 
 from 
  the .proto definition. 

 protoc --descriptor_set_out=FILE ? 

 Oliver 


-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.


Re: [protobuf] Parse a .proto file

2014-10-31 Thread Pradeep Gollakota


Ok… awesome… I do have the .proto’s ahead of time, so I can have them 
compiled to the .desc files and store those.

Here’s my .proto file:

package com.lithum.pbnj;

import google/protobuf/descriptor.proto;

option java_package = com.lithium.pbnj;

extend google.protobuf.FieldOptions {
optional bool isPii = 50101;
}

message MessagePublish {
required string uuid = 1;
required int64 timestamp = 2;
required int64 message_uid = 3;
required string message_content = 4;
required int64 message_author_uid = 5;
optional string email = 6 [(isPii) = true];
}

I compiled this .proto file into a .desc file using the command you gave 
me. I’m now trying to parse a DynamicMessage from the .desc file. Here’s 
the code I have so far.

DescriptorProtos.FileDescriptorSet descriptorSet = 
DescriptorProtos.FileDescriptorSet.parseFrom(PBnJ.class.getResourceAsStream(/messages.desc));
Descriptors.Descriptor desc = 
descriptorSet.getFile(0).getDescriptorForType();

Messages.MessagePublish event = Messages.MessagePublish.newBuilder()
.setUuid(UUID.randomUUID().toString())
.setTimestamp(System.currentTimeMillis())
.setEmail(he...@example.com)
.setMessageAuthorUid(1)
.setMessageContent(hello world!)
.setMessageUid(1)
.build();

DynamicMessage dynamicMessage = DynamicMessage.parseFrom(desc, 
event.toByteArray());

The final line in the above code is throwing the following exception:

Exception in thread main com.google.protobuf.InvalidProtocolBufferException: 
Protocol message end-group tag did not match expected tag.
at 
com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
at 
com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:174)
at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:478)
at 
com.google.protobuf.MessageReflection$BuilderAdapter.parseMessage(MessageReflection.java:482)
at 
com.google.protobuf.MessageReflection.mergeFieldFrom(MessageReflection.java:780)
at 
com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:336)
at 
com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:318)
at 
com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:229)
at 
com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:180)
at 
com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:419)
at 
com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:229)
at 
com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:171)
at 
com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:412)
at com.google.protobuf.DynamicMessage.parseFrom(DynamicMessage.java:119)
at com.lithium.pbnj.PBnJ.main(PBnJ.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

On Friday, October 31, 2014 11:39:27 AM UTC-7, Oliver wrote:

Basically, you can't do that in pure Java - the compiler is a C++ 
 binary, there is no Java version. 

 Still, working with the output of --descriptor_set_out is probably the 
 way to go here. If you have the .proto file ahead of time, you can 
 pregenerate the descriptor output at build time and store it instead 
 of the .proto file. If you don't have the .proto file ahead of time 
 (and you can't redesign - this is not a good design) then you could 
 run the compiler at runtime and read the output. Either way, now you 
 have a parsed version of the message format as a protobuf-encoded 
 message that you can read into your Java program and extract the 
 Descriptors you need. 

 If you're looking at a selfdescribing message format, then I'd go with 
 using the parsed descriptors as your format description, not the text 
 .proto file. 

 Oliver 


 On 31 October 2014 17:56, Pradeep Gollakota prade...@gmail.com 
 javascript: wrote: 
  Hi Oliver, 
  
  Thanks for the response! I guess my question wasn't quite clear. In my 
 java 
  code I have a string which contains the content of a .proto file. Given 
 this 
  string, how can I create an instance of a Descriptor class so I can do 
  DynamicMessage parsing. 
  
  Thanks! 
  - Pradeep 
  
  On Thursday, October 30, 2014 2:41:19 PM UTC-7, Oliver wrote: 
  
  On 30 October 2014 02:53, Pradeep Gollakota prade...@gmail.com 
 wrote: 
  
   I have a use case where I need to parse messages without having the 
   corresponding precompiled classes in Java. So the DynamicMessage

Re: [protobuf] Parse a .proto file

2014-10-31 Thread Pradeep Gollakota


Ok… so I was finally able to parse a dynamic message and it looks good. It 
looks like it was just a user error on my part… after a little bit of 
digging around, I found the right APIs to call. Now my code looks like:

Descriptors.FileDescriptor fieldOptionsDesc = 
DescriptorProtos.FieldOptions.getDescriptor().getFile();
DescriptorProtos.FileDescriptorSet set = 
DescriptorProtos.FileDescriptorSet.parseFrom(
PBnJ.class.getResourceAsStream(/messages.desc));
Descriptors.Descriptor md = 
Descriptors.FileDescriptor.buildFrom(set.getFile(0),
new 
Descriptors.FileDescriptor[]{fieldOptionsDesc}).findMessageTypeByName(MessagePublish);

Messages.MessagePublish event = Messages.MessagePublish.newBuilder()

.setUuid(UUID.randomUUID().toString())
.setTimestamp(System.currentTimeMillis())
.setEmail(he...@example.com)
.setMessageAuthorUid(1)
.setMessageContent(hello world!)
.setMessageUid(1)
.build();

DynamicMessage dynamicMessage = DynamicMessage.parseFrom(md, 
event.toByteArray());
// Parse worked!

for (Descriptors.FieldDescriptor fieldDescriptor : md.getFields()) {
Boolean extension = 
fieldDescriptor.getOptions().getExtension(Messages.isPii);
System.out.println(fieldDescriptor.getName() +  isPii =  + 
extension);
}

The output is:

uuid isPii = false
timestamp isPii = false
message_uid isPii = false
message_content isPii = false
message_author_uid isPii = false
email isPii = false

For some reason, this is incorrectly showing “isPii = false” for the email 
field when it should be “isPii = true” (as it is in the .proto file). Any 
thoughts on this?

Thanks again all!
On Friday, October 31, 2014 2:18:44 PM UTC-7, Ilia Mirkin wrote:

At no point are you specifying that you want to use the 
 MessagePublish descriptor, so you must still be using the API 
 incorrectly... 

 On Fri, Oct 31, 2014 at 5:10 PM, Pradeep Gollakota prade...@gmail.com 
 javascript: wrote: 
  Ok… awesome… I do have the .proto’s ahead of time, so I can have them 
  compiled to the .desc files and store those. 
  
  Here’s my .proto file: 
  
  package com.lithum.pbnj; 
  
  import google/protobuf/descriptor.proto; 
  
  option java_package = com.lithium.pbnj; 
  
  extend google.protobuf.FieldOptions { 
  optional bool isPii = 50101; 
  } 
  
  message MessagePublish { 
  required string uuid = 1; 
  required int64 timestamp = 2; 
  required int64 message_uid = 3; 
  required string message_content = 4; 
  required int64 message_author_uid = 5; 
  optional string email = 6 [(isPii) = true]; 
  } 
  
  I compiled this .proto file into a .desc file using the command you gave 
 me. 
  I’m now trying to parse a DynamicMessage from the .desc file. Here’s the 
  code I have so far. 
  
  DescriptorProtos.FileDescriptorSet descriptorSet = 
  
 DescriptorProtos.FileDescriptorSet.parseFrom(PBnJ.class.getResourceAsStream(/messages.desc));
  

  Descriptors.Descriptor desc = 
  descriptorSet.getFile(0).getDescriptorForType(); 
  
  Messages.MessagePublish event = 
 Messages.MessagePublish.newBuilder() 
  .setUuid(UUID.randomUUID().toString()) 
  .setTimestamp(System.currentTimeMillis()) 
  .setEmail(he...@example.com javascript:) 
  .setMessageAuthorUid(1) 
  .setMessageContent(hello world!) 
  .setMessageUid(1) 
  .build(); 
  
  DynamicMessage dynamicMessage = DynamicMessage.parseFrom(desc, 
  event.toByteArray()); 
  
  The final line in the above code is throwing the following exception: 
  
  Exception in thread main 
  com.google.protobuf.InvalidProtocolBufferException: Protocol message 
  end-group tag did not match expected tag. 
  at 
  
 com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
  

  at 
  
 com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:174)
  

  at 
  
 com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:478) 
  at 
  
 com.google.protobuf.MessageReflection$BuilderAdapter.parseMessage(MessageReflection.java:482)
  

  at 
  
 com.google.protobuf.MessageReflection.mergeFieldFrom(MessageReflection.java:780)
  

  at 
  
 com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:336)
  

  at 
  
 com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:318)
  

  at 
  
 com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:229)
  

  at 
  
 com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:180)
  

  at 
  
 com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:419

Re: [protobuf] Parse a .proto file

2014-10-31 Thread Pradeep Gollakota
Not really... one of the use cases I'm trying to solve for is an 
anonymization use case. We will have several app's writing protobuf records 
and the data will pass through an anonymization layer. The anonymizer 
inspects the schema's for all incoming data and will transform the pii 
fields. Since I will be defining the custom options that will be used by 
the app dev's, I will have precompiled classes available for reference just 
like the code shows.

So what I'm trying to figure out is, using the DynamicMessage API and 
having parsed a Descriptor, how do I find all the fields which have been 
annotated with the (isPii = true) option.

On Friday, October 31, 2014 3:25:51 PM UTC-7, Ilia Mirkin wrote:

 On Fri, Oct 31, 2014 at 6:18 PM, Pradeep Gollakota prade...@gmail.com 
 javascript: wrote: 
  Boolean extension = 
  fieldDescriptor.getOptions().getExtension(Messages.isPii); 

 Shouldn't this use some sort of API that doesn't use the Messages class at 
 all? 

   -ilia 


-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.


Re: [protobuf] Parse a .proto file

2014-10-31 Thread Pradeep Gollakota
Confirmed... When I replaced the md variable with the compiled
Descriptor, it worked. I didn't think I was mixing the descriptors, e.g.
the MessagePublish message is one that is produced via the compiled API and
parsed using the DynamicMessage API. The isPii extension has been
refactored into a separate proto that is precompiled into my codebase. I.E.
the descriptor for MessagePublish should be loaded dynamically and the
descriptor for the FieldOption I'm defining won't be loaded dynamically. As
far as I can tell, there shouldn't be any mixing of the descriptor pools,
though I may be wrong.

Any thoughts on how I can proceed with this project?

On Fri Oct 31 2014 at 7:01:55 PM Oliver Jowett oliver.jow...@gmail.com
wrote:

 You may be running into issues where the set of descriptors associated
 with your parsed DynamicMessage (i.e. the ones you parsed at runtime)
 do not match the set of descriptors from your pregenerated code (which
 will be using their own descriptor pool). IIRC they're looked up by
 identity, so even if they have the same structure they won't match if
 loaded separately. It's a bit of a wart in the API - I'm not sure what
 the right way to do this is, if you try to mix pregenerated code 
 dynamically loaded descriptors, all sorts of things break.

 Oliver


 On 1 November 2014 00:48, Pradeep Gollakota pradeep...@gmail.com wrote:
  Not really... one of the use cases I'm trying to solve for is an
  anonymization use case. We will have several app's writing protobuf
 records
  and the data will pass through an anonymization layer. The anonymizer
  inspects the schema's for all incoming data and will transform the pii
  fields. Since I will be defining the custom options that will be used by
 the
  app dev's, I will have precompiled classes available for reference just
 like
  the code shows.
 
  So what I'm trying to figure out is, using the DynamicMessage API and
 having
  parsed a Descriptor, how do I find all the fields which have been
 annotated
  with the (isPii = true) option.
 
  On Friday, October 31, 2014 3:25:51 PM UTC-7, Ilia Mirkin wrote:
 
  On Fri, Oct 31, 2014 at 6:18 PM, Pradeep Gollakota prade...@gmail.com
  wrote:
   Boolean extension =
   fieldDescriptor.getOptions().getExtension(Messages.isPii);
 
  Shouldn't this use some sort of API that doesn't use the Messages class
 at
  all?
 
-ilia
 
  --
  You received this message because you are subscribed to the Google Groups
  Protocol Buffers group.
  To unsubscribe from this group and stop receiving emails from it, send an
  email to protobuf+unsubscr...@googlegroups.com.
  To post to this group, send email to protobuf@googlegroups.com.
  Visit this group at http://groups.google.com/group/protobuf.
  For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.


[protobuf] Parse a .proto file

2014-10-30 Thread Pradeep Gollakota
Hi Protobuf gurus,

I'm trying to parse a .proto file in Java to use with DynamicMessages. Is 
this possible or does it have to be compiled to a descriptor set file 
first before this can be done?

I have a use case where I need to parse messages without having the 
corresponding precompiled classes in Java. So the DynamicMessage seems to 
be the correct fit, but I'm not sure how I can generate the DescriptorSet 
from the .proto definition.

Thanks in advance,
Pradeep

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.


Re: Upgrading a coprocessor

2014-10-29 Thread Pradeep Gollakota
At Lithium, we power Klout using HBase. We load Klout scores for about 500
million users into HBase every night. When a load is happening, we noticed
that the performance of klout.com was severely degraded. We also see
severely degraded performance when performing operations like compactions.
In order to mitigate this, we stood up 2 HBase cluster in an
Active/Standy configuration (not the built in replication, but something
else entirely). We serve data from the Active cluster and load data into
the Standby and then swap, load into the other cluster while serving from
the cluster that just got the update.

We don't use coprocessors, so we didn't have the problem you're describing.
However, in our configuration, what we would do is upgrade the coprocessor
in the Standby and then swap the clusters. But since you would have to
stand up a second HBase cluster, this may be a non-starter for you. Just
another option thrown into the mix. :)

On Wed Oct 29 2014 at 12:07:02 PM Michael Segel mse...@segel.com wrote:

 Well you could redesign your cp.

 There is a way to work around the issue by creating a cp that's really a
 framework and then manage the cps in a different jvm(s) using messaging
 between the two.
 So if you want to reload or restart your cp, you can do it outside of the
 RS.

 Its a bit more work...


 On Oct 29, 2014, at 9:21 AM, Ted Yu yuzhih...@gmail.com wrote:

  Rolling restart of servers may have bigger impact on operations - server
  hosting hbase:meta would be involved which has more impact compared to
  disabling / enabling user table.
 
  You should give ample timeout to your client. The following is an
  incomplete list of configs (you can find their explanation on
  http://hbase.apache.org/book.html):
 
  hbase.client.scanner.timeout.period
  hbase.rpc.timeout
 
  Cheers
 
  On Tue, Oct 28, 2014 at 11:18 PM, Hayden Marchant hayd...@amobee.com
  wrote:
 
  Thanks all for confirming what I thought was happening.
 
  I am considering implementing a pattern similar to Iain's in that I
  version that path of the cp, and disable/enable the table while
 upgrading
  the cp metadata.
 
  However, what are the operational considerations of disabling a table
 for
  a number of seconds, versus rolling restart of region servers? Assuming
  that however hard I try, there still might be a process or 2 that are
  accessing that table at that time. What sort of error handling will I
 need
  to more aware of now (I assume that MapReduce would recover from either
 of
  these two strategies?)
 
  Thanks,
  Hayden
 
  
  From: iain wright iainw...@gmail.com
  Sent: Wednesday, October 29, 2014 1:51 AM
  To: user@hbase.apache.org
  Subject: Re: Upgrading a coprocessor
 
  Hi Hayden,
 
  We ran into the same thing  ended up going with a rudimentary cp deploy
  script for appending epoch to the cp name, placing on hdfs, and
  disabling/modifying hbase table/enabling
 
  Heres the issue for this: https://issues.apache.org/
 jira/browse/HBASE-9046
 
  -
 
  --
  Iain Wright
 
  This email message is confidential, intended only for the recipient(s)
  named above and may contain information that is privileged, exempt from
  disclosure under applicable law. If you are not the intended recipient,
 do
  not disclose or disseminate the message to anyone except the intended
  recipient. If you have received this message in error, or are not the
 named
  recipient(s), please immediately notify the sender by return email, and
  delete all copies of this message.
 
  On Tue, Oct 28, 2014 at 10:51 AM, Bharath Vissapragada 
  bhara...@cloudera.com wrote:
 
  Hi Hayden,
 
  Currently there is no workaround. We can't unload already loaded
 classes
  unless we make changes to Hbase's classloader design and I believe its
  not
  that trivial.
 
  - Bharath
 
  On Tue, Oct 28, 2014 at 2:52 AM, Hayden Marchant hayd...@amobee.com
  wrote:
 
  I have been using a RegionObserver coprocessor on my HBase 0.94.6
  cluster
  for quite a while and it works great. I am currently upgrading the
  functionality. When doing some testing in our integration environment
 I
  met
  with the issue that even when I uploaded a new version of my
  coprocessor
  jar to HDFS, HBase did not recognize it, and it kept using the old
  version.
 
  I even disabled/reenabled the table - no help. Even with a new table,
  it
  still loads old class. Only when I changed the location of the jar in
  HDFS,
  did it load the new version.
 
  I looked at the source code of CoprocessorHost and I see that it is
  forever holding a classloaderCache with no mechanism for clearing it
  out.
 
  I assume that if I restart the region server it will take the new
  version
  of my coprocessor.
 
  Is there any workaround for upgrading a coprocessor without either
  changing the path, or restarting the HBase region server?
 
  Thanks,
  Hayden
 
 
 
 
  --
  Bharath Vissapragada
  http://www.cloudera.com
 
 




Re: Is there a way to indicate that the data is sorted for a group-by operation?

2014-10-13 Thread Pradeep Gollakota
This is a great question!

I could be wrong, but I don't believe there is a way to indicate this for a
group-by. It definitely does matter for performance if your input is
globally sorted. Currently a group by happens on reduce side. But if the
input is globally sorted, this can happen map side for a significant
performance boost.

I did see a CollectableLoadFunc
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/CollectableLoadFunc.html
interface that's used in the MergeJoin algorithm... I don't see why this
couldn't be used for a map side group by also.

On Sun, Oct 12, 2014 at 11:48 PM, Sunil S Nandihalli 
sunil.nandiha...@gmail.com wrote:

 Hi Everybody,
  Is there a way to indicate that the data is sorted by the key using which
 the relations are being grouped? Or does it even matter for performance
 whether we indicate it or not?
 Thanks,
 Sunil.



Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
Hi Ankur,

Is the list of regular expressions static or dynamic? If it's a static
list, you can collapse all the filter operators into a single operator and
use the AND keyword to combine them.

E.g.

 Filtered_Data = FILTER BagName BY ($0 matches 'RegEx-1') AND ($0 matches
'RegEx-2') AND ($0 matches 'RegEx-3');

If it's dynamic, you can use the option that Russell and Prashant
suggested. Write a UDF that loads a list of regular expressions and
processes them in sequence.

On Mon, Oct 6, 2014 at 12:44 PM, Ankur Kasliwal ankur.kasliwal...@gmail.com
 wrote:

 Hi,



 I have written a ‘Pig Script’ which is processing Sequence files given as
 input.

 It is working fine but there is one problem mentioned below.



 I have repetitive statements in my pig script,  as shown below:





-  Filtered_Data _1= FILTER BagName BY ($0 matches 'RegEx-1');
-  Filtered_Data_2 = FILTER BagName BY ($0 matches 'RegEx-2');
-  Filtered_Data_3 = FILTER BagName BY ($0 matches 'RegEx-3');
- So on…



 Question :

 So is there any way by which I can have above statement written once and

 then loop through all possible “RegEx” and substitute in Pig script.



 For Example:


 Filtered_Data _X  =   FILTER BagName BY ($0 matches 'RegEx');  ( have this
 statement once )

 ( loop through all possible RegEx and substitute value in the statement )



 Right now I am calling Pig script from a shell script, so any way from
 shell script will be also be welcome..



 Thanks in advance.

 Happy Pigging



Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
In case you haven't seen this already, take a look at
http://pig.apache.org/docs/r0.13.0/perf.html for some basic strategies on
optimizing your pig scripts.

On Mon, Oct 6, 2014 at 1:08 PM, Russell Jurney russell.jur...@gmail.com
wrote:

 Actually, I don't think you need SelectFieldByValue. Just use the name of
 the field directly.

 On Monday, October 6, 2014, Prashant Kommireddi prash1...@gmail.com
 wrote:

  Are these regex static? If yes, this is easily achieved with embedding
 your
  script in Java or any other language that Pig supports
  http://pig.apache.org/docs/r0.13.0/cont.html
 
  You could also possibly write a UDF that loops through all the regex and
  returns result.
 
 
 
  On Mon, Oct 6, 2014 at 12:44 PM, Ankur Kasliwal 
  ankur.kasliwal...@gmail.com javascript:;
   wrote:
 
   Hi,
  
  
  
   I have written a ‘Pig Script’ which is processing Sequence files given
 as
   input.
  
   It is working fine but there is one problem mentioned below.
  
  
  
   I have repetitive statements in my pig script,  as shown below:
  
  
  
  
  
  -  Filtered_Data _1= FILTER BagName BY ($0 matches 'RegEx-1');
  -  Filtered_Data_2 = FILTER BagName BY ($0 matches 'RegEx-2');
  -  Filtered_Data_3 = FILTER BagName BY ($0 matches 'RegEx-3');
  - So on…
  
  
  
   Question :
  
   So is there any way by which I can have above statement written once
 and
  
   then loop through all possible “RegEx” and substitute in Pig script.
  
  
  
   For Example:
  
  
   Filtered_Data _X  =   FILTER BagName BY ($0 matches 'RegEx');  ( have
  this
   statement once )
  
   ( loop through all possible RegEx and substitute value in the
 statement )
  
  
  
   Right now I am calling Pig script from a shell script, so any way from
   shell script will be also be welcome..
  
  
  
   Thanks in advance.
  
   Happy Pigging
  
 


 --
 Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
 datasyndrome.com



Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
It looks like the best option at this point is to write a custom UDF that
takes loads a set of regular expressions from file and runs the data
through all of them.

On Mon, Oct 6, 2014 at 1:44 PM, Ankur Kasliwal ankur.kasliwal...@gmail.com
wrote:

 Thanks for replying everyone. Few comments to everyone's suggestion.

 1  I am processing sequence file which consist of many CSV files. I need
 to extract only few among all CSV'S. So that is the reason I am doing 
 'SelectFieldByValue'
 which is file name in my case not by field directly.

 2  All selected files ( different RegEx ) are stored in HDFS separately.
 So one STORE statement for each extracted file in a bag.

 3  Cannot  do cross join as all files input will get combined, do not
 want to do that.

 4  Cannot do AND/OR operator as i need different bags for each selected
 file ( RegEx).



 Let me know if any one has any other suggestions.
 Sorry for not being clear with specification at first place.

 Thanks.

 On Mon, Oct 6, 2014 at 4:12 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 In case you haven't seen this already, take a look at
 http://pig.apache.org/docs/r0.13.0/perf.html for some basic strategies on
 optimizing your pig scripts.

 On Mon, Oct 6, 2014 at 1:08 PM, Russell Jurney russell.jur...@gmail.com
 wrote:

  Actually, I don't think you need SelectFieldByValue. Just use the name
 of
  the field directly.
 
  On Monday, October 6, 2014, Prashant Kommireddi prash1...@gmail.com
  wrote:
 
   Are these regex static? If yes, this is easily achieved with embedding
  your
   script in Java or any other language that Pig supports
   http://pig.apache.org/docs/r0.13.0/cont.html
  
   You could also possibly write a UDF that loops through all the regex
 and
   returns result.
  
  
  
   On Mon, Oct 6, 2014 at 12:44 PM, Ankur Kasliwal 
   ankur.kasliwal...@gmail.com javascript:;
wrote:
  
Hi,
   
   
   
I have written a ‘Pig Script’ which is processing Sequence files
 given
  as
input.
   
It is working fine but there is one problem mentioned below.
   
   
   
I have repetitive statements in my pig script,  as shown below:
   
   
   
   
   
   -  Filtered_Data _1= FILTER BagName BY ($0 matches 'RegEx-1');
   -  Filtered_Data_2 = FILTER BagName BY ($0 matches 'RegEx-2');
   -  Filtered_Data_3 = FILTER BagName BY ($0 matches 'RegEx-3');
   - So on…
   
   
   
Question :
   
So is there any way by which I can have above statement written once
  and
   
then loop through all possible “RegEx” and substitute in Pig script.
   
   
   
For Example:
   
   
Filtered_Data _X  =   FILTER BagName BY ($0 matches 'RegEx');  (
 have
   this
statement once )
   
( loop through all possible RegEx and substitute value in the
  statement )
   
   
   
Right now I am calling Pig script from a shell script, so any way
 from
shell script will be also be welcome..
   
   
   
Thanks in advance.
   
Happy Pigging
   
  
 
 
  --
  Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
  datasyndrome.com
 





Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects

2014-10-06 Thread Pradeep Gollakota
I agree with the answers suggested above.

3. B
4. D
5. C

On Mon, Oct 6, 2014 at 2:58 PM, Ulul had...@ulul.org wrote:

  Hi

 No, Pig is a data manipulation language for data already in Hadoop.
 The question is about importing data from OLTP DB (eg Oracle, MySQL...) to
 Hadoop, this is what Sqoop is for (SQL to Hadoop)

 I'm not certain certification guys are happy with their exam questions
 ending up on blogs and mailing lists :-)

 Ulul

  Le 06/10/2014 13:54, unmesha sreeveni a écrit :

  what about the last one? The answer is correct. Pig. Is nt it?

 On Mon, Oct 6, 2014 at 4:29 PM, adarsh deshratnam 
 adarsh.deshrat...@gmail.com wrote:

 For question 3 answer should be B and for question 4 answer should be D.

  Thanks,
 Adarsh D

 Consultant - BigData and Cloud

 [image: View my profile on LinkedIn]
 http://in.linkedin.com/in/adarshdeshratnam

 On Mon, Oct 6, 2014 at 2:25 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

  Hi

  5 th question can it be SQOOP?

 On Mon, Oct 6, 2014 at 1:24 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

  Yes

 On Mon, Oct 6, 2014 at 1:22 PM, Santosh Kumar 
 skumar.bigd...@hotmail.com wrote:

  Are you preparing g for Cloudera certification exam?





 Thanks and Regards,

 Santosh Kumar SINHA http://www.linkedin.com/in/sinhasantosh

 (510) 936-2650

 Sr Data Consultant - BigData Implementations.

 [image: View my profile on LinkedIn]
 http://www.linkedin.com/in/sinhasantosh







 *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com]
 *Sent:* Monday, October 06, 2014 12:45 AM
 *To:* User - Hive; User Hadoop; User Pig
 *Subject:* [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects




 http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html



 --

 *Thanks  Regards *



 *Unmesha Sreeveni U.B*

 *Hadoop, Bigdata Developer*

 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*

 http://www.unmeshasreeveni.blogspot.in/








  --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B *
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
  http://www.unmeshasreeveni.blogspot.in/





  --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B *
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
  http://www.unmeshasreeveni.blogspot.in/






  --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B *
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
  http://www.unmeshasreeveni.blogspot.in/






Re: datanode down, disk replaced , /etc/fstab changed. Can't bring it back up. Missing lock file?

2014-10-03 Thread Pradeep Gollakota
Looks like you're facing the same problem as this SO.
http://stackoverflow.com/questions/10705140/hadoop-datanode-fails-to-start-throwing-org-apache-hadoop-hdfs-server-common-sto

Try the suggested fix.

On Fri, Oct 3, 2014 at 6:57 PM, Colin Kincaid Williams disc...@uw.edu
wrote:

 We had a datanode go down, and our datacenter guy swapped out the disk. We
 had moved to using UUIDs in the /etc/fstab, but he wanted to use the
 /dev/id format. He didn't backup the fstab, however I'm not sure that's the
 issue.

 I am reading in the log below that the namenode has a lock on the disk? I
 don't know how that works. I thought the lockfile would belong to the
 datanode itself. How do I remove the lock from the namenode to bring the
 datanode back up?

 If that's not the issue, how can I bring the datanode back up? Help would
 be greatly appreciated.




 2014-10-03 18:28:18,121 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = us3sm2hb027r09.comp.prod.local/10.51.28.172
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 2.3.0-cdh5.0.1
 STARTUP_MSG:   classpath =
 

Re: Block placement without rack aware

2014-10-02 Thread Pradeep Gollakota
It appears to be randomly chosen. I just came across this blog post from
Lars George about HBase file locality in HDFS
http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html

On Thu, Oct 2, 2014 at 4:12 PM, SF Hadoop sfhad...@gmail.com wrote:

 What is the block placement policy hadoop follows when rack aware is not
 enabled?

 Does it just round robin?

 Thanks.



Re: Hbase Read/Write poor performance. Please help!

2014-08-13 Thread Pradeep Gollakota
Can you post the client code you're using to read/write from HBase?


On Wed, Aug 13, 2014 at 11:21 AM, kacperolszewski kacperolszew...@o2.pl
wrote:

 Hello there, I'm running a read/write benchmark on a huge data (tweeter
 posts) for my school project.
 The problem im dealing with is that the tests are going extreamly slow.
 I dont know how to optimize the process. Hbase is using only about 10% of
 RAM memory, and 40% of CPU.
 I've been experimenting with properties of hbase-site.xml file but I can't
 see any results. I think there is something I need to put in the API's code.
 It's a single node, local operation. Virtual machine has 8GB RAM. Any help
 will be appreciated.
 Best Regards - Kacper, student


Re: Json Loader - Array of objects - Loading results in empty data set

2014-08-08 Thread Pradeep Gollakota
I think there's a problem with your schema.

 {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2:
chararray)})}

should probably look like

 {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2:
chararray))})}


On Thu, Aug 7, 2014 at 11:22 AM, Klüber, Ralf ralf.klue...@p3-group.com
wrote:

 Hello,



 I am new to this list. I tried to solve this problem for the last 48h but
 I am stuck. I hope someone here can hint me in the right direction.



 I have problems using the Pig JsonLoader and wondering if I do something
 wrong or I encounter another problem.



 The 1st half of this post is to show I know a at least something about
 what I am talking and that I did my homework. During research I found a lot
 about elephant-bird but there seems to be a conflict with cloudera. This
 way I am stuck as well. If you trust me already you can directly jump to
 the 2nd half of my post ,-).



 The desired solution should work both, in Cloudera and on Amazon EMR.



 To proof something works.

 --



 I have this data file:

 ```

 $ cat a.json


 {DataASet:{A1:1,A2:4,DataBSets:[{B1:1,B2:1},{B1:2,B2:2}]}}

 $ ./jq '.' a.json

 {

   DataASet: {

 A1: 1,

 A2: 4,

 DataBSets: [

   {

 B1: 1,

 B2: 1

   },

   {

 B1: 2,

 B2: 2

   }

 ]

   }

 }

 $

 ```



 I am using this Pig Script to load it.



 ``` Pig

 a = load 'a.json' using JsonLoader('

  DataASet: (

A1:int,

A2:int,

DataBSets: {

 (

B1:chararray,

B2:chararray

  )

}

  )

 ');

 ```



 In grunt everything seems ok.



 ```

 grunt describe a;

 a: {DataASet: (A1: int,A2: int,DataBSets: {(B1: chararray,B2: chararray)})}

 grunt dump a;

 ((1,4,{(1,1),(2,2)}))

 grunt

 ```



 So far so good.



 Real Problem

 



 In fact my real data (Gigabytes) looks a little bit different. The array
 is in fact an array of an object.



 ```

 $ ./jq '.' b.json

 {

   DataASet: {

 A1: 1,

 A2: 4,

 DataBSets: [

   {

 DataBSet: {

   B1: 1,

   B2: 1

 }

   },

   {

 DataBSet: {

   B1: 2,

   B2: 2

 }

   }

 ]

   }

 }

 $ cat b.json


 {DataASet:{A1:1,A2:4,DataBSets:[{DataBSet:{B1:1,B2:1}},{DataBSet:{B1:2,B2:2}}]}}

 $

 ```



 I trying to load this json with the following schema:



 ``` Pig

 b = load 'b.json' using JsonLoader('

  DataASet: (

A1:int,

A2:int,

DataBSets: {

 DataBSet: (

B1:chararray,

B2:chararray

  )

}

  )

 ');

 ```



 Again it looks good so far in grunt.



 ```

 grunt describe b;

 b: {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2:
 chararray)})} ```



 I expect someting like this when dumping b:



 ```

 ((1,4,{((1,1)),((2,2))}))

 ```



 But I get this:



 ```

 grunt dump b;

 ()

 grunt

 ```



 Obviously I am doing something wrong. An empty set hints in the direction
 that the schema does not match on the input line.



 Any hints? Thanks in advance.



 Kind regards.

 Ralf



Re: Json Loader - Array of objects - Loading results in empty data set

2014-08-08 Thread Pradeep Gollakota
I haven't worked with JsonLoader much, so I'm not sure what the problem is.
But your schema looks correct for your JSON structure now.

DataBSets is an Array (or Bag) of Objects (or Tuples). Each Object (or
Tuple) inside the Array has one key which maps to an Object(or Tuple) with
two keys. This is exactly what you would want the structure to look like in
pig.

```
Grunt  describe b;
b: {DataASet: (A1: int,A2: int,DataBSets: {tuple_0: (DataBSet: (B1:
chararray,B2: chararray))})}
grunt dump b;
()
grunt
```

I know that lots of people have been having problems with JsonLoader in the
past. I can recall off-hand several emails over the past year on this
mailing list complaining about the loader. Most of the recommendations,
remembering off the top of my head, have been to use the Elephant bird
version of the Loader.

I'm not sure what the version conflict you're seeing with cdh +
elephant-bird, but I'd recommend compiling elephant-bird with the correct
version of hadoop + pig that you're using and deploy it to your maven repo.
I myself do this so that I know that all the code is compiled against
correct version that we're running in house.

I'm going to look into this problem a little bit more and see if I can get
it to work without elephant-bird.


On Fri, Aug 8, 2014 at 8:44 AM, Klüber, Ralf ralf.klue...@p3-group.com
wrote:

 Hello,

 Much appreciated you taking your time to answer.

  should probably look like
 
   {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2:
  chararray))})}

 How to achieve this? I tried:
 ```
 b = load 'b.json' using JsonLoader('
  DataASet: (
A1:int,
A2:int,
DataBSets: {
 (
 (DataBSet: (
B1:chararray,
B2:chararray
  )
 ))
}
  )
  ');
 ```

 Which gives this schema which does not look right.
 Dump fails (empty bag)

 ```
 Grunt  describe b;
 b: {DataASet: (A1: int,A2: int,DataBSets: {tuple_0: (DataBSet: (B1:
 chararray,B2: chararray))})}
 grunt dump b;
 ()
 grunt
 ```

 Kind regards.
 Ralf

  -Ursprüngliche Nachricht-
  Von: Pradeep Gollakota [mailto:pradeep...@gmail.com]
  Gesendet: Friday, August 08, 2014 2:21 PM
  An: user@pig.apache.org
  Betreff: Re: Json Loader - Array of objects - Loading results in empty
 data set
 
  I think there's a problem with your schema.
 
   {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2:
  chararray)})}
 
  should probably look like
 
   {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2:
  chararray))})}
 
 
  On Thu, Aug 7, 2014 at 11:22 AM, Klüber, Ralf ralf.klue...@p3-group.com
 
  wrote:



Rolling upgrades

2014-08-01 Thread Pradeep Gollakota
Hi All,

Is it possible to do a rolling upgrade from Hadoop 2.2 to 2.4?

Thanks,
Pradeep


Kafka Go Client

2014-07-22 Thread Pradeep Gollakota
Hi All,

I was watching the talks from the Kafka meet up at LinkedIn last month.
While answering a question on producers spilling to disk, Neha mentioned
that there was a Go client that had this feature. I was wondering if the
client that does this is https://github.com/Shopify/sarama/issues.

I'm writing a Go application interacting with Kafka and the producers
spilling to disk is a desirable feature for me.

Thanks,
Pradeep


Re: Query On Pig

2014-07-01 Thread Pradeep Gollakota
i. Equals can be mimicked by specifying both = and = (i.e. -lte=123
-gte=123)
ii. What do you mean by taking a partial rowkey? the lte and gte are
partial matches.


On Mon, Jun 30, 2014 at 10:24 PM, Nivetha K nivethak3...@gmail.com wrote:

 Hi,


  Iam working with Pig.


 I need to know some information on HBaseStorage.



 B = LOAD 'hbase://sample' using
 org.apache.pig.backend.hadoop.hbase.HBaseStorage('details:* details1:*
 details2:*','-loadKey true -lte=123') as
 (id:chararray,m1:map[],m2:map[],m3:map[]);


 (i)   like lte ((ie) Less than Equalto) is there any option like equalto

 (ii) Is there any possible to take the partial rowkey.



Re: Query On Pig

2014-07-01 Thread Pradeep Gollakota
i. That's correct.
ii. If the key partial match is at the beginning of the row key, then what
you're looking for is the -gte and -lt/-lte flags. If you want to start
with 123, you just specify -gte 123 -lt 124. This would have the same
affect as a partial starts with match. If what you're looking for is more
complex, trunk has a new feature of HBaseStorage which adds the -regex
flag. You can use this to do more complex matching.


On Mon, Jun 30, 2014 at 11:43 PM, Nivetha K nivethak3...@gmail.com wrote:

 (i) There is no direct way to take exact match

 (ii) Partial row key means

  consider my rowkeys are
 123456,123678,123678,124789,124789.. i need to take the rowkeys
 starts with 123

 On 1 July 2014 11:36, Pradeep Gollakota pradeep...@gmail.com wrote:

  i. Equals can be mimicked by specifying both = and = (i.e. -lte=123
  -gte=123)
  ii. What do you mean by taking a partial rowkey? the lte and gte are
  partial matches.
 
 
  On Mon, Jun 30, 2014 at 10:24 PM, Nivetha K nivethak3...@gmail.com
  wrote:
 
   Hi,
  
  
Iam working with Pig.
  
  
   I need to know some information on HBaseStorage.
  
  
  
   B = LOAD 'hbase://sample' using
   org.apache.pig.backend.hadoop.hbase.HBaseStorage('details:* details1:*
   details2:*','-loadKey true -lte=123') as
   (id:chararray,m1:map[],m2:map[],m3:map[]);
  
  
   (i)   like lte ((ie) Less than Equalto) is there any option like
 equalto
  
   (ii) Is there any possible to take the partial rowkey.
  
 



Re: [DISCUSS] Kafka Security Specific Features

2014-06-06 Thread Pradeep Gollakota
I'm actually not convinced that encryption needs to be handled server side
in Kafka. I think the best solution for encryption is to handle it
producer/consumer side just like compression. This will offload key
management to the users and we'll still be able to leverage the sendfile
optimization for better performance.


On Fri, Jun 6, 2014 at 10:48 AM, Rob Withers robert.w.with...@gmail.com
wrote:

 On consideration, if we have 3 different access groups (1 for production
 WRITE and 2 consumers) they all need to decode the same encryption and so
 all need the same public/private keycerts won't work, unless you write
 a CertAuthority to build multiple certs with the same keys.  Better seems
 to not use certs and wrap the encryption specification with an ACL
 capabilities for each group of access.


 On Jun 6, 2014, at 11:43 AM, Rob Withers wrote:

  This is quite interesting to me and it is an excelent opportunity to
 promote a slightly different security scheme.  Object-capabilities are
 perfect for online security and would use ACL style authentication to gain
 capabilities filtered to those allowed resources for allow actions
 (READ/WRITE/DELETE/LIST/SCAN).  Erights.org has the quitenscential (??)
 object capabilities model and capnproto is impleemting this for C++.  I
 have a java implementation at http://github.com/pauwau/pauwau but the
 master is broken.  0.2 works, basically.  B asically a TLS connection with
 no certificate server, it is peer to peer.  It has some advanced features,
 but the lining of capabilities with authorization so that you can only
 invoke correct services is extended to the secure user.

 Regarding non-repudiation, on disk, why not prepend a CRC?

 Regarding on-disk encryption, multiple users/groups may need to access,
 with different capabilities.  Sounds like zookeeper needs to store a cert
 for each class of access so that a group member can access the decrypted
 data from disk.  Use cert-based async decryption.  The only isue is storing
 the private key in zookeeper.  Perhaps some hash magic could be used.

 Thanks for kafka,
 Rob

 On Jun 5, 2014, at 3:01 PM, Jay Kreps wrote:

  Hey Joe,

 I don't really understand the sections you added to the wiki. Can you
 clarify them?

 Is non-repudiation what SASL would call integrity checks? If so don't SSL
 and and many of the SASL schemes already support this as well as
 on-the-wire encryption?

 Or are you proposing an on-disk encryption scheme? Is this actually
 needed?
 Isn't a on-the-wire encryption when combined with mutual authentication
 and
 permissions sufficient for most uses?

 On-disk encryption seems unnecessary because if an attacker can get root
 on
 the kafka boxes it can potentially modify Kafka to do anything he or she
 wants with data. So this seems to break any security model.

 I understand the problem of a large organization not really having a
 trusted network and wanting to secure data transfer and limit and audit
 data access. The uses for these other things I don't totally understand.

 Also it would be worth understanding the state of other messaging and
 storage systems (Hadoop, dbs, etc). What features do they support. I
 think
 there is a sense in which you don't have to run faster than the bear, but
 only faster then your friends. :-)

 -Jay


 On Wed, Jun 4, 2014 at 5:57 PM, Joe Stein joe.st...@stealth.ly wrote:

  I like the idea of working on the spec and prioritizing. I will update
 the
 wiki.

 - Joestein


 On Wed, Jun 4, 2014 at 1:11 PM, Jay Kreps jay.kr...@gmail.com wrote:

  Hey Joe,

 Thanks for kicking this discussion off! I totally agree that for

 something

 that acts as a central message broker security is critical feature. I

 think

 a number of people have been interested in this topic and several
 people
 have put effort into special purpose security efforts.

 Since most the LinkedIn folks are working on the consumer right now I

 think

 this would be a great project for any other interested people to take
 on.
 There are some challenges in doing these things distributed but it can

 also

 be a lot of fun.

 I think a good first step would be to get a written plan we can all
 agree
 on for how things should work. Then we can break things down into
 chunks
 that can be done independently while still aiming at a good end state.

 I had tried to write up some notes that summarized at least the
 thoughts

 I

 had had on security:
 https://cwiki.apache.org/confluence/display/KAFKA/Security

 What do you think of that?

 One assumption I had (which may be incorrect) is that although we want

 all

 the things in your list, the two most pressing would be authentication

 and

 authorization, and that was all that write up covered. You have more
 experience in this domain, so I wonder how you would prioritize?

 Those notes are really sketchy, so I think the first goal I would have
 would be to get to a real spec we can all agree on and discuss. A lot
 of
 the security stuff has a 

Re: Ambari with Druid

2014-06-05 Thread Pradeep Gollakota
https://cwiki.apache.org/confluence/display/AMBARI/Stacks+and+Services


On Wed, Jun 4, 2014 at 10:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Custom stack ? Is this new feature with 1.6.0 ?
 Is this exposed from Web UI ? Or i this only REST interface ?


 On Wed, Jun 4, 2014 at 10:00 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 Ambari has a concept of custom stacks. So, you can write a custom stack
 to deploy Druid. At installation time, you can choose to install your Druid
 stack but not the Hadoop stack.


 On Wed, Jun 4, 2014 at 9:21 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Hello
 I really like the way ambari can be used to setup cluster effortlessly.
 I would like to use this benefits

 1. Druid is a distributed columnar store (druid.io)
 2. Druid has several node types like, historical, realtime, broker and
 coordinator.
 3. Each of these can be started by unpacking a tar and start the java
 class with Java 1.7. Its that simple.
 4. Now i would like to use ambari to install druid on these various
 nodes, just like we install hadoop on various nodes and then assign various
 service to each of them.

 How easy or difficult is to remove hadoop* out of ambari and replace it
 with druid* or any other installation?

 What do you think ? Can ambari be made that generic ?

 1. Ambari allows you to install hadoop on a bunch of nodes and provide
 monitoring like (service, cpu, disk, io)
 2. Can ambari be made hadoop agnostic ? or generic enough that any
 cluster could be installed with it ?
  --
 Deepak





 --
 Deepak




[jira] [Commented] (AMBARI-5707) Replace Ganglia with high performant and pluggable Metrics System

2014-06-04 Thread Pradeep Gollakota (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017422#comment-14017422
 ] 

Pradeep Gollakota commented on AMBARI-5707:
---

I too agree that it may not be the best idea for Ambari to rebuild components 
of the stack. However, I would like to see a pluggable architecture for 
Metrics. For example, we use Datadog heavily, so it would be great if Ambari 
could plug into our existing metrics infrastructure and pull graphs directly 
from Datadog.

 Replace Ganglia with high performant and pluggable Metrics System
 -

 Key: AMBARI-5707
 URL: https://issues.apache.org/jira/browse/AMBARI-5707
 Project: Ambari
  Issue Type: New Feature
  Components: agent, controller
Affects Versions: 1.6.0
Reporter: Siddharth Wagle
Assignee: Siddharth Wagle
Priority: Critical
 Attachments: MetricsSystemArch.png


 Ambari Metrics System
 - Ability to collect metrics from Hadoop and other Stack services
 - Ability to retain metrics at a high precision for a configurable time 
 period (say 5 days)
 - Ability to automatically purge metrics after retention period
 - At collection time, provide clear integration point for external system 
 (such as TSDB)
 - At purge time, provide clear integration point for metrics retention by 
 external system
 - Should provide default options for external metrics retention (say “HDFS”)
 - Provide tools / utilities for analyzing metrics in retention system (say 
 “Hive schema, Pig scripts, etc” that can be used with the default retention 
 store “HDFS”)
 System Requirements
 - Must be portable and platform independent
 - Must not conflict with any existing metrics system (such as Ganglia)
 - Must not conflict with existing SNMP infra
 - Must not run as root
 - Must have HA story (no SPOF)
 Usage
 - Ability to obtain metrics from Ambari REST API (point in time and temporal)
 - Ability to view metric graphs in Ambari Web (currently, fixed)
 - Ability to configure custom metric graphs in Ambari Web (currently, we have 
 metric graphs “fixed” into the UI)
 - Need to improve metric graph “navigation” in Ambari Web (currently, metric 
 graphs do not allow navigation at arbitrary timeframes, but only at ganglia 
 aggregation intervals) 
 - Ability to “view cluster” at point in time (i.e. see all metrics at that 
 point)
 - Ability to define metrics (and how + where to obtain) in Stack Definitions



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address), cust_email;

​


On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe drah...@googlemail.com wrote:

 Hi All,

 I have imported hive table into pig having a complex data type
 (ARRAYString). The alias in pig looks as below

 grunt describe A;
 A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
 (innerfield: chararray)},cust_email: chararray}

 grunt dump A;

 (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},t...@gmail.com)
 (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)

 The cust_address is the ARRAY field from hive. I want to FLATTEN the
 cust_address into different fields.


 Expected output
 (2200,benjamin avenue,philadelphia)
 (44,atlanta franklin,florida)

 please help

 Regards,
 Rahul



Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
Disregard last email.

Sorry... didn't fully understand the question.


On Mon, Jun 2, 2014 at 8:44 AM, Pradeep Gollakota pradeep...@gmail.com
wrote:

 FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address), cust_email;

 ​


 On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe drah...@googlemail.com
 wrote:

 Hi All,

 I have imported hive table into pig having a complex data type
 (ARRAYString). The alias in pig looks as below

 grunt describe A;
 A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
 (innerfield: chararray)},cust_email: chararray}

 grunt dump A;

 (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},t...@gmail.com)
 (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)

 The cust_address is the ARRAY field from hive. I want to FLATTEN the
 cust_address into different fields.


 Expected output
 (2200,benjamin avenue,philadelphia)
 (44,atlanta franklin,florida)

 please help

 Regards,
 Rahul





Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
There was a similar question as this on StackOverflow a while back. The
suggestion was to write a custom BagToTuple UDF.

http://stackoverflow.com/questions/18544602/how-to-flatten-a-group-into-a-single-tuple-in-pig


On Mon, Jun 2, 2014 at 8:46 AM, Pradeep Gollakota pradeep...@gmail.com
wrote:

 Disregard last email.

 Sorry... didn't fully understand the question.


 On Mon, Jun 2, 2014 at 8:44 AM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address), cust_email;

 ​


 On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe drah...@googlemail.com
 wrote:

 Hi All,

 I have imported hive table into pig having a complex data type
 (ARRAYString). The alias in pig looks as below

 grunt describe A;
 A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
 (innerfield: chararray)},cust_email: chararray}

 grunt dump A;

 (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},t...@gmail.com)
 (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)

 The cust_address is the ARRAY field from hive. I want to FLATTEN the
 cust_address into different fields.


 Expected output
 (2200,benjamin avenue,philadelphia)
 (44,atlanta franklin,florida)

 please help

 Regards,
 Rahul






Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
If you're using the built-in BagToTuple UDF, then you probably don't need
the FLATTEN operator.

I suspect that your output looks as follows:

2200
benjamin avenue
philadelphia
...

Can you confirm that this is what you're seeing?


On Mon, Jun 2, 2014 at 9:52 AM, Rahul Channe drah...@googlemail.com wrote:

 Thank You Pradeep, it worked to a certain extend but having following
 difficulty in separating fields as $0,$1 for the customer_address.


 Example -

 grunt describe A;
 A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
 (innerfield: chararray)},cust_email: chararray}

 grunt dump A;

 (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},t...@gmail.com)
 (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)

 grunt B = foreach A generate FLATTEN(BagToTuple(cust_address));
 grunt dump B;
 (2200,benjamin franklin,philadelphia)
 (44,atlanta franklin,florida)

 grunt describe B;
 B: {org.apache.pig.builtin.bagtotuple_cust_address_34::innerfield:
 chararray}



 I am not able to seperate the fields in B as $0,$1 and $3 ,tried using
 STRSPLIT but didnt work.



 On Mon, Jun 2, 2014 at 11:50 AM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

  There was a similar question as this on StackOverflow a while back. The
  suggestion was to write a custom BagToTuple UDF.
 
 
 
 http://stackoverflow.com/questions/18544602/how-to-flatten-a-group-into-a-single-tuple-in-pig
 
 
  On Mon, Jun 2, 2014 at 8:46 AM, Pradeep Gollakota pradeep...@gmail.com
  wrote:
 
   Disregard last email.
  
   Sorry... didn't fully understand the question.
  
  
   On Mon, Jun 2, 2014 at 8:44 AM, Pradeep Gollakota 
 pradeep...@gmail.com
   wrote:
  
   FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address),
  cust_email;
  
   ​
  
  
   On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe drah...@googlemail.com
   wrote:
  
   Hi All,
  
   I have imported hive table into pig having a complex data type
   (ARRAYString). The alias in pig looks as below
  
   grunt describe A;
   A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
   (innerfield: chararray)},cust_email: chararray}
  
   grunt dump A;
  
   (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},
 t...@gmail.com
  )
   (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)
  
   The cust_address is the ARRAY field from hive. I want to FLATTEN the
   cust_address into different fields.
  
  
   Expected output
   (2200,benjamin avenue,philadelphia)
   (44,atlanta franklin,florida)
  
   please help
  
   Regards,
   Rahul
  
  
  
  
 



Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
Awesome... that's the way I would have done it as well.


On Mon, Jun 2, 2014 at 10:14 AM, Rahul Channe drah...@googlemail.com
wrote:

 I tried changing the hive column datatype from ARRAY to STRUCT for
 cust_address, then i imported the table in pig.

 Now I am able to separate the fields, as below

 grunt Z = load 'cust_info' using org.apache.hcatalog.pig.HCatLoader();
 grunt describe Z;
 Z: {cust_id: int,cust_name: chararray,cust_address: (house_no: int,street:
 chararray,city: chararray)}


 grunt Y = foreach Z generate cust_address.house_no as
 house_no,cust_address.street as street,UPPER(cust_address.city) as city;
 grunt describe Y;
 Y: {house_no: int,street: chararray,city: chararray}

 grunt dump Y;
 (2200,benjamin franklin,PHILADELPHIA)
 (44,atlanta franklin,FLORIDA)


 On Mon, Jun 2, 2014 at 1:09 PM, Rahul Channe drah...@googlemail.com
 wrote:

  grunt B = foreach A generate BagToTuple(cust_address);
 
  grunt describe B;
  B: {org.apache.pig.builtin.bagtotuple_cust_address_24: (innerfield:
  chararray)}
 
  grunt dump B;
  ((2200,benjamin franklin,philadelphia))
  ((44,atlanta franklin,florida))
 
 
 
 
  On Mon, Jun 2, 2014 at 12:59 PM, Pradeep Gollakota pradeep...@gmail.com
 
  wrote:
 
  If you're using the built-in BagToTuple UDF, then you probably don't
 need
  the FLATTEN operator.
 
  I suspect that your output looks as follows:
 
  2200
  benjamin avenue
  philadelphia
  ...
 
  Can you confirm that this is what you're seeing?
 
 
  On Mon, Jun 2, 2014 at 9:52 AM, Rahul Channe drah...@googlemail.com
  wrote:
 
   Thank You Pradeep, it worked to a certain extend but having following
   difficulty in separating fields as $0,$1 for the customer_address.
  
  
   Example -
  
   grunt describe A;
   A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
   (innerfield: chararray)},cust_email: chararray}
  
   grunt dump A;
  
   (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},
 t...@gmail.com)
   (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)
  
   grunt B = foreach A generate FLATTEN(BagToTuple(cust_address));
   grunt dump B;
   (2200,benjamin franklin,philadelphia)
   (44,atlanta franklin,florida)
  
   grunt describe B;
   B: {org.apache.pig.builtin.bagtotuple_cust_address_34::innerfield:
   chararray}
  
  
  
   I am not able to seperate the fields in B as $0,$1 and $3 ,tried using
   STRSPLIT but didnt work.
  
  
  
   On Mon, Jun 2, 2014 at 11:50 AM, Pradeep Gollakota 
  pradeep...@gmail.com
   wrote:
  
There was a similar question as this on StackOverflow a while back.
  The
suggestion was to write a custom BagToTuple UDF.
   
   
   
  
 
 http://stackoverflow.com/questions/18544602/how-to-flatten-a-group-into-a-single-tuple-in-pig
   
   
On Mon, Jun 2, 2014 at 8:46 AM, Pradeep Gollakota 
  pradeep...@gmail.com
wrote:
   
 Disregard last email.

 Sorry... didn't fully understand the question.


 On Mon, Jun 2, 2014 at 8:44 AM, Pradeep Gollakota 
   pradeep...@gmail.com
 wrote:

 FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address),
cust_email;

 ​


 On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe 
  drah...@googlemail.com
 wrote:

 Hi All,

 I have imported hive table into pig having a complex data type
 (ARRAYString). The alias in pig looks as below

 grunt describe A;
 A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
 (innerfield: chararray)},cust_email: chararray}

 grunt dump A;

 (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},
   t...@gmail.com
)
 (124,diego arty,{(44),(atlanta franklin),(florida)},
  o...@gmail.com)

 The cust_address is the ARRAY field from hive. I want to FLATTEN
  the
 cust_address into different fields.


 Expected output
 (2200,benjamin avenue,philadelphia)
 (44,atlanta franklin,florida)

 please help

 Regards,
 Rahul




   
  
 
 
 



Re: Upgrade from Hbase 0.94.6 to 0.96 (From CDH45. - CDH 5.0)

2014-06-02 Thread Pradeep Gollakota
Hortonworks has written a bridge tool to help with this. As far as I know,
this will only work for replicating from a 0.94 cluster to a 0.96 cluster.

Check out https://github.com/hortonworks/HBaseReplicationBridgeServer


On Mon, Jun 2, 2014 at 7:35 AM, yanivG yaniv.yancov...@gmail.com wrote:

 Hi,
 I am trying to upgrade from CDH4.5 (which contains Hbase 0.94.6) to CDH5.0
 (which contains hbase 0.96).
 From Cloudera documentation I found that:
 Rolling upgrades from CDH 4 to CDH 5 are not possible because existing CDH
 4 HBase clients cannot make requests to CDH 5 servers and CDH 5 HBase
 clients cannot make requests to CDH 4 servers. Replication between CDH 4
 and
 CDH 5 is not currently supported.

 We have 2 clusters in production using replication. The question is how to
 perform the upgrade on live system, if replication isn't possible from 4.5
 to 5.0.
 Lets assume I will upgrade the non primary cluster to 5.0, while the live
 cluster will remain the one with 4.5. How can I catch up the data
 afterwards?
 Does CopyTable works between 4.5 to 5.0? Is there any compatible tool to do
 the migration?
 Yaniv



 --
 View this message in context:
 http://apache-hbase.679495.n3.nabble.com/Upgrade-from-Hbase-0-94-6-to-0-96-From-CDH45-CDH-5-0-tp4059931.html
 Sent from the HBase User mailing list archive at Nabble.com.



Re: How to sample an inner bag?

2014-05-27 Thread Pradeep Gollakota
@Mehmet... great hack! I like it :-P


On Tue, May 27, 2014 at 5:08 PM, Mehmet Tepedelenlioglu 
mehmets...@yahoo.com wrote:

 If you know how many items you want from each inner bag exactly, you can
 hack it like this:

 x = foreach x {
 y = foreach x generate RANDOM() as rnd, *;
 y = order y by rnd;
 y = limit y $SAMPLE_NUM;
 y = foreach y generate $1 ..;
 generate group, y;
 }

 Basically randomize the inner bag, sort it wrt the random number and limit
 it to the sample size you want. No reducers needed.
 If the inner bags are huge, ordering will obviously be expensive. If you
 don’t like this, you might have to write your own udf.

 Mehmet

 On May 27, 2014, at 10:03 AM, william.dowl...@thomsonreuters.com 
 william.dowl...@thomsonreuters.com wrote:

  Hi Pig users,
 
  Is there an easy/efficient way to sample an inner bag? For example, with
 input in a relation like
 
  (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
  (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
  (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
 
  I’d like to sample 1/3 the elements of the bags, and get something like
 (ignoring the non-determinism)
  (id1,att1,{(x,0.999749968742)})
  (id1,att2,{(b,0.04)})
  (id2,att1,{(b,0.05)})
 
  I have a circumlocution that seems to work using flatten+ group but that
 looks ugly to me:
 
  tfidf1 = load '$tfidf' as (id: chararray,
   att: chararray,
   pairs: {pair: (word: chararray, value:
 double)});
 
  flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
  sample_flat_tfidf = sample flat_tfidf 0.33;
  tfidf2 = group sample_flat_tfidf by (id, att);
 
  tfidf = foreach tfidf2 {
pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
generate group.id, group.att, pairs;
  };
 
  Can someone suggest a better way to do this?  Many thanks!
 
  William F Dowling
  Senior Technologist
 
  Thomson Reuters
 
 
 




  1   2   3   >