Podling Pinot Report Reminder - August 2020

2020-07-23 Thread jmclean
Dear podling,

This email was sent by an automated system on behalf of the Apache
Incubator PMC. It is an initial reminder to give you plenty of time to
prepare your quarterly board report.

The board meeting is scheduled for Wed, 19 August 2020.
The report for your podling will form a part of the Incubator PMC
report. The Incubator PMC requires your report to be submitted 2 weeks
before the board meeting, to allow sufficient time for review and
submission (Wed, August 05).

Please submit your report with sufficient time to allow the Incubator
PMC, and subsequently board members to review and digest. Again, the
very latest you should submit your report is 2 weeks prior to the board
meeting.

Candidate names should not be made public before people are actually
elected, so please do not include the names of potential committers or
PPMC members in your report.

Thanks,

The Apache Incubator PMC

Submitting your Report

--

Your report should contain the following:

*   Your project name
*   A brief description of your project, which assumes no knowledge of
the project or necessarily of its field
*   A list of the three most important issues to address in the move
towards graduation.
*   Any issues that the Incubator PMC or ASF Board might wish/need to be
aware of
*   How has the community developed since the last report
*   How has the project developed since the last report.
*   How does the podling rate their own maturity.

This should be appended to the Incubator Wiki page at:

https://cwiki.apache.org/confluence/display/INCUBATOR/August2020

Note: This is manually populated. You may need to wait a little before
this page is created from a template.

Note: The format of the report has changed to use markdown.

Mentors
---

Mentors should review reports for their project(s) and sign them off on
the Incubator wiki page. Signing off reports shows that you are
following the project - projects that are not signed may raise alarms
for the Incubator PMC.

Incubator PMC

-
To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org
For additional commands, e-mail: dev-h...@pinot.apache.org



Multiple segments with bin/pinot-admin.sh LaunchDataIngestionJob

2020-07-23 Thread katneni ravikiran
Hi,

I am trying to find information on how segmentation works on the data
ingested using "bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile ".

I have used the basic table configuration with no indexes defined and the
segmentation section only has

"segmentsConfig" : { "replication" : "2", "schemaName" : "Customer" }

When I use this command to upload data of size 7GB, it is creating only a
single segment with 52 million rows.

Can any one help me in finding the configuration to specify the
segmentation column? or any other configuration by which I can control the
number of segments?

Thanks for your help in advance.

Ravikiran


Re: Multiple segments with bin/pinot-admin.sh LaunchDataIngestionJob

2020-07-23 Thread Neha Pawar
Hi Ravikiran,

There is no config to control the number of segments. But you can control
them by splitting your input file into multiple files. The data ingestion
job will generate as many segments as the number of input files in your
input folder.
Added this question to FAQs:
https://docs.pinot.apache.org/basics/getting-started/frequent-questions#how-to-control-number-of-segments-generated

Thanks,
Neha Pawar




On Thu, Jul 23, 2020 at 9:33 AM katneni ravikiran <
katneniraviki...@gmail.com> wrote:

> Hi,
>
> I am trying to find information on how segmentation works on the data
> ingested using "bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile ".
>
> I have used the basic table configuration with no indexes defined and the
> segmentation section only has
>
> "segmentsConfig" : { "replication" : "2", "schemaName" : "Customer" }
>
> When I use this command to upload data of size 7GB, it is creating only a
> single segment with 52 million rows.
>
> Can any one help me in finding the configuration to specify the
> segmentation column? or any other configuration by which I can control the
> number of segments?
>
> Thanks for your help in advance.
>
> Ravikiran
>


Apache Pinot Daily Email Digest (2020-07-23)

2020-07-23 Thread Pinot Slack Email Digest
#general@ssmgood: @ssmgood has joined the 
channel@blcksrx: Hey Guys! I'm honored to announced that 
now is available to use *Apache Pinot* with  *PYTHON* SQL DB-API!
please check this out:
@damianoporta:
 Hello everybody! Finally i get my small cluster up and running, thank 
you all for the support! :slightly_smiling_face: i am doing a final test to 
understand if i need to add one more node or not. However, just to make one 
thing a little bit clearer, i would like to know if we can "organize" data 
inside a Pinot Server by a specific column. For those of you who know Citus, I 
am referring to the distribution key for shards. Basically what i am asking is, 
if we have a specific column that is often used in group by clause, How can we 
store documents that have the same column (used in group by) on the same 
server? I think it is an important thing. Because for example, in my custom 
aggregation func i need to sort the documents of each segment (in 
`aggregateGroupBySV()`) before working on it (i am trying to do a similar thing 
that *window functions* do). I know that a Server has more segments and the 
documents order in segments could be random BUT if i have all the documents 
of that specific key in the same server i could avoid sorting again everything 
in `extractFinalResult()`  that is called at Broker level. I know there is a 
`merge()` method used to merge all the results of each segment, if i can do 
something after that MERGE i can shift all the computation process at the 
Server level instead of Broker and i think it is an important thing, otherwise 
the Broker should work with all the results of each Server and then 
sort+compute (in my case).@axitkhurana: @axitkhurana has 
joined the channel@mailtobuchi: Simple question: Does 
broker send the list of segments to be queried to Server along with the query? 
I think not but want to double 
check.#random@ssmgood: @ssmgood has 
joined the channel@axitkhurana: @axitkhurana has joined 
the channel#troubleshooting@ronak: 
@ronak has joined the channel@yash.agarwal: What 
is the correct schema for a date column ? i am using the following,
```{
  "name": "sls_d",
  "dataType": "STRING",
  "format": "1:DAYS:SIMPLE_DATE_FORMAT:-MM-dd",
  "granularity": "1:DAYS"
}```
 but i am getting
```Caused by: java.lang.IllegalArgumentException: Invalid format: "null"
at 
org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187)
 
~[pinot-all.jar:0.4.0-8355d2e0e489a8d127f2e32793671fba505628a8]```#pinot-dev@axitkhurana:
 @axitkhurana has joined the channel@jlli: Hi 
@npawar, I have a PR to fix the issue of  incorrectly fetching the value of 
multi-value column. Could you review it? 
@npawar:
 @jlli, is it possible to fix the avro files? such that the field 
you’re interested in is not in a 
GenericRecord?#presto-pinot-streaming@jackie.jxt:
 @elon.azoulay Did you get a chance to address the comments in the 
PR?@jackie.jxt: Can you please give me push access to your 
fork branch so that I can also work on it?@elon.azoulay: 
Will be pushing shortly, sure!@g.kishore: i got 
it to compile@g.kishore: it was just intelliJ acting 
weird