[jira] [Commented] (SOLR-9562) Minimize queried collections for time series alias

2016-10-06 Thread Eungsop Yoo (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15554054#comment-15554054
 ] 

Eungsop Yoo commented on SOLR-9562:
---

I will come back sooner or later. It would be better to open new issue for 
*time series router*, not this?

> Minimize queried collections for time series alias
> --
>
> Key: SOLR-9562
> URL: https://issues.apache.org/jira/browse/SOLR-9562
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Eungsop Yoo
>Priority: Minor
> Attachments: SOLR-9562-v2.patch, SOLR-9562.patch
>
>
> For indexing time series data(such as large log data), we can create a new 
> collection regularly(hourly, daily, etc.) with a write alias and create a 
> read alias for all of those collections. But all of the collections of the 
> read alias are queried even if we search over very narrow time window. In 
> this case, the docs to be queried may be stored in very small portion of 
> collections. So we don't need to do that.
> I suggest this patch for read alias to minimize queried collections. Three 
> parameters for CREATEALIAS action are added.
> || Key || Type || Required || Default || Description ||
> | timeField | string | No | | The time field name for time series data. It 
> should be date type. |
> | dateTimeFormat | string | No | | The format of timestamp for collection 
> creation. Every collection should has a suffix(start with "_") with this 
> format. 
> Ex. dateTimeFormat: MMdd, collectionName: col_20160927
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> | timeZone | string | No | | The time zone information for dateTimeFormat 
> parameter.
> Ex. GMT+9. 
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> And then when we query with filter query like this "timeField:\[fromTime TO 
> toTime\]", only the collections have the docs for a given time range will be 
> queried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9562) Minimize queried collections for time series alias

2016-10-06 Thread Eungsop Yoo (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15554034#comment-15554034
 ] 

Eungsop Yoo commented on SOLR-9562:
---

Yes, AutoAddReplicas requires shared file systems. Actually my cluster is 
running on HDFS(replication factor 3) with 1 replica and AutoAddReplicas 
enabled. AutoAddReplicas feature works so so. At first there was a bug of 
missing docs during 
failover([SOLR-9236|https://issues.apache.org/jira/browse/SOLR-9236]), but it 
is fixed now. But there is still a problem. It takes very long time to 
failover, especially transaction log replaying takes longer time than I expect. 
So I keep tlogs as small as possible now.

> Minimize queried collections for time series alias
> --
>
> Key: SOLR-9562
> URL: https://issues.apache.org/jira/browse/SOLR-9562
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Eungsop Yoo
>Priority: Minor
> Attachments: SOLR-9562-v2.patch, SOLR-9562.patch
>
>
> For indexing time series data(such as large log data), we can create a new 
> collection regularly(hourly, daily, etc.) with a write alias and create a 
> read alias for all of those collections. But all of the collections of the 
> read alias are queried even if we search over very narrow time window. In 
> this case, the docs to be queried may be stored in very small portion of 
> collections. So we don't need to do that.
> I suggest this patch for read alias to minimize queried collections. Three 
> parameters for CREATEALIAS action are added.
> || Key || Type || Required || Default || Description ||
> | timeField | string | No | | The time field name for time series data. It 
> should be date type. |
> | dateTimeFormat | string | No | | The format of timestamp for collection 
> creation. Every collection should has a suffix(start with "_") with this 
> format. 
> Ex. dateTimeFormat: MMdd, collectionName: col_20160927
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> | timeZone | string | No | | The time zone information for dateTimeFormat 
> parameter.
> Ex. GMT+9. 
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> And then when we query with filter query like this "timeField:\[fromTime TO 
> toTime\]", only the collections have the docs for a given time range will be 
> queried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9562) Minimize queried collections for time series alias

2016-10-06 Thread Eungsop Yoo (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15554031#comment-15554031
 ] 

Eungsop Yoo commented on SOLR-9562:
---

I run a SolrCloud cluster for log indexing with a daily created collection 
which has only 1 replica and multiple shards. (But they are stored in HDFS with 
3 replica and autoAddReplicas feature is enabled.) In my use case the query 
performance doesn't matter so 1 replica would be enough. The indexing 
performance for given system resources is best with 1 replica. But in some 
other use cases your idea would make sense. 
Deleting TTL expired 
documents([SOLR-5795|https://issues.apache.org/jira/browse/SOLR-5795]) is not 
efficient for large log data. So I create and delete a daily collection every 
morning in my crontab. We need to find a smarter way for maintaining 
collections or shards of time series data.


> Minimize queried collections for time series alias
> --
>
> Key: SOLR-9562
> URL: https://issues.apache.org/jira/browse/SOLR-9562
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Eungsop Yoo
>Priority: Minor
> Attachments: SOLR-9562-v2.patch, SOLR-9562.patch
>
>
> For indexing time series data(such as large log data), we can create a new 
> collection regularly(hourly, daily, etc.) with a write alias and create a 
> read alias for all of those collections. But all of the collections of the 
> read alias are queried even if we search over very narrow time window. In 
> this case, the docs to be queried may be stored in very small portion of 
> collections. So we don't need to do that.
> I suggest this patch for read alias to minimize queried collections. Three 
> parameters for CREATEALIAS action are added.
> || Key || Type || Required || Default || Description ||
> | timeField | string | No | | The time field name for time series data. It 
> should be date type. |
> | dateTimeFormat | string | No | | The format of timestamp for collection 
> creation. Every collection should has a suffix(start with "_") with this 
> format. 
> Ex. dateTimeFormat: MMdd, collectionName: col_20160927
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> | timeZone | string | No | | The time zone information for dateTimeFormat 
> parameter.
> Ex. GMT+9. 
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> And then when we query with filter query like this "timeField:\[fromTime TO 
> toTime\]", only the collections have the docs for a given time range will be 
> queried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9562) Minimize queried collections for time series alias

2016-10-05 Thread Eungsop Yoo (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550996#comment-15550996
 ] 

Eungsop Yoo commented on SOLR-9562:
---

I see. 

I found some articles related to this issue and read them.
http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud?focusedCommentId=61317676#comment-61317676

On manual sharding, the client should do some work related to shard for 
indexing and querying. But It seems this work can be moved to SolrCloud server 
from the client. So we can make new time series router which does the works 
related to sharding for time series data. How do you think about this approach?

> Minimize queried collections for time series alias
> --
>
> Key: SOLR-9562
> URL: https://issues.apache.org/jira/browse/SOLR-9562
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Eungsop Yoo
>Priority: Minor
> Attachments: SOLR-9562-v2.patch, SOLR-9562.patch
>
>
> For indexing time series data(such as large log data), we can create a new 
> collection regularly(hourly, daily, etc.) with a write alias and create a 
> read alias for all of those collections. But all of the collections of the 
> read alias are queried even if we search over very narrow time window. In 
> this case, the docs to be queried may be stored in very small portion of 
> collections. So we don't need to do that.
> I suggest this patch for read alias to minimize queried collections. Three 
> parameters for CREATEALIAS action are added.
> || Key || Type || Required || Default || Description ||
> | timeField | string | No | | The time field name for time series data. It 
> should be date type. |
> | dateTimeFormat | string | No | | The format of timestamp for collection 
> creation. Every collection should has a suffix(start with "_") with this 
> format. 
> Ex. dateTimeFormat: MMdd, collectionName: col_20160927
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> | timeZone | string | No | | The time zone information for dateTimeFormat 
> parameter.
> Ex. GMT+9. 
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> And then when we query with filter query like this "timeField:\[fromTime TO 
> toTime\]", only the collections have the docs for a given time range will be 
> queried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9562) Minimize queried collections for time series alias

2016-10-03 Thread Eungsop Yoo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eungsop Yoo updated SOLR-9562:
--
Description: 
For indexing time series data(such as large log data), we can create a new 
collection regularly(hourly, daily, etc.) with a write alias and create a read 
alias for all of those collections. But all of the collections of the read 
alias are queried even if we search over very narrow time window. In this case, 
the docs to be queried may be stored in very small portion of collections. So 
we don't need to do that.

I suggest this patch for read alias to minimize queried collections. Three 
parameters for CREATEALIAS action are added.

|| Key || Type || Required || Default || Description ||
| timeField | string | No | | The time field name for time series data. It 
should be date type. |
| dateTimeFormat | string | No | | The format of timestamp for collection 
creation. Every collection should has a suffix(start with "_") with this 
format. 
Ex. dateTimeFormat: MMdd, collectionName: col_20160927
See 
[DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
 |
| timeZone | string | No | | The time zone information for dateTimeFormat 
parameter.
Ex. GMT+9. 
See 
[DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
 |


And then when we query with filter query like this "timeField:\[fromTime TO 
toTime\]", only the collections have the docs for a given time range will be 
queried.

  was:
For indexing time series data(such as large log data), we can create a new 
collection regularly(hourly, daily, etc.) with a write alias and create a read 
alias for all of those collections. But all of the collections of the read 
alias are queried even if we search over very narrow time window. In this case, 
the docs to be queried may be stored in very small portion of collections. So 
we don't need to do that.

I suggest this patch for read alias to minimize queried collections. Three 
parameters for CREATEALIAS action are added.

|| Key || Type || Required || Default || Description ||
| timeField | string | No | | The time field name for time series data. It 
should be date type. |
| dateTimeFormat | string | No | | The format of timestamp for collection 
creation. Every collection should has a suffix(start with "_") with this 
format. 
Ex. dateTimeFormat: MMdd, collectionName: col_20160927
See 
[SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
 |
| timeZone | string | No | | The time zone information for dateTimeFormat 
parameter.
Ex. GMT+9. 
See 
[SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
 |


And then when we query with filter query like this "timeField:\[fromTime TO 
toTime\]", only the collections have the docs for a given time range will be 
queried.


> Minimize queried collections for time series alias
> --
>
> Key: SOLR-9562
> URL: https://issues.apache.org/jira/browse/SOLR-9562
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Eungsop Yoo
>Priority: Minor
> Attachments: SOLR-9562-v2.patch, SOLR-9562.patch
>
>
> For indexing time series data(such as large log data), we can create a new 
> collection regularly(hourly, daily, etc.) with a write alias and create a 
> read alias for all of those collections. But all of the collections of the 
> read alias are queried even if we search over very narrow time window. In 
> this case, the docs to be queried may be stored in very small portion of 
> collections. So we don't need to do that.
> I suggest this patch for read alias to minimize queried collections. Three 
> parameters for CREATEALIAS action are added.
> || Key || Type || Required || Default || Description ||
> | timeField | string | No | | The time field name for time series data. It 
> should be date type. |
> | dateTimeFormat | string | No | | The format of timestamp for collection 
> creation. Every collection should has a suffix(start with "_") with this 
> format. 
> Ex. dateTimeFormat: MMdd, collectionName: col_20160927
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> | timeZone | string | No | | The time zone information for dateTimeFormat 
> parameter.
> Ex. GMT+9. 
> See 
> [DateTimeFormatter|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
>  |
> And then when we query with filter query like this "timeField:\[fromTime TO 
> toTime\]", only the collections have the docs for a given time range will be 
> queried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SOLR-9562) Minimize queried collections for time series alias

2016-10-03 Thread Eungsop Yoo (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15544140#comment-15544140
 ] 

Eungsop Yoo commented on SOLR-9562:
---

I backported this patch to my own cluster, Solr 4.10.3-cdh5.4.9.
It took over 20 seconds to query against last 30 minutes over the collections 
of 14 days without this patch, but it takes only 3 seconds now.

> Minimize queried collections for time series alias
> --
>
> Key: SOLR-9562
> URL: https://issues.apache.org/jira/browse/SOLR-9562
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Eungsop Yoo
>Priority: Minor
> Attachments: SOLR-9562-v2.patch, SOLR-9562.patch
>
>
> For indexing time series data(such as large log data), we can create a new 
> collection regularly(hourly, daily, etc.) with a write alias and create a 
> read alias for all of those collections. But all of the collections of the 
> read alias are queried even if we search over very narrow time window. In 
> this case, the docs to be queried may be stored in very small portion of 
> collections. So we don't need to do that.
> I suggest this patch for read alias to minimize queried collections. Three 
> parameters for CREATEALIAS action are added.
> || Key || Type || Required || Default || Description ||
> | timeField | string | No | | The time field name for time series data. It 
> should be date type. |
> | dateTimeFormat | string | No | | The format of timestamp for collection 
> creation. Every collection should has a suffix(start with "_") with this 
> format. 
> Ex. dateTimeFormat: MMdd, collectionName: col_20160927
> See 
> [SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
>  |
> | timeZone | string | No | | The time zone information for dateTimeFormat 
> parameter.
> Ex. GMT+9. 
> See 
> [SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
>  |
> And then when we query with filter query like this "timeField:\[fromTime TO 
> toTime\]", only the collections have the docs for a given time range will be 
> queried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9562) Minimize queried collections for time series alias

2016-10-03 Thread Eungsop Yoo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eungsop Yoo updated SOLR-9562:
--
Attachment: SOLR-9562-v2.patch

Some bugs are fixed.
SimpleDateFormat is replaced with DateTimeFormatter.

> Minimize queried collections for time series alias
> --
>
> Key: SOLR-9562
> URL: https://issues.apache.org/jira/browse/SOLR-9562
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Eungsop Yoo
>Priority: Minor
> Attachments: SOLR-9562-v2.patch, SOLR-9562.patch
>
>
> For indexing time series data(such as large log data), we can create a new 
> collection regularly(hourly, daily, etc.) with a write alias and create a 
> read alias for all of those collections. But all of the collections of the 
> read alias are queried even if we search over very narrow time window. In 
> this case, the docs to be queried may be stored in very small portion of 
> collections. So we don't need to do that.
> I suggest this patch for read alias to minimize queried collections. Three 
> parameters for CREATEALIAS action are added.
> || Key || Type || Required || Default || Description ||
> | timeField | string | No | | The time field name for time series data. It 
> should be date type. |
> | dateTimeFormat | string | No | | The format of timestamp for collection 
> creation. Every collection should has a suffix(start with "_") with this 
> format. 
> Ex. dateTimeFormat: MMdd, collectionName: col_20160927
> See 
> [SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
>  |
> | timeZone | string | No | | The time zone information for dateTimeFormat 
> parameter.
> Ex. GMT+9. 
> See 
> [SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
>  |
> And then when we query with filter query like this "timeField:\[fromTime TO 
> toTime\]", only the collections have the docs for a given time range will be 
> queried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9562) Minimize queried collections for time series alias

2016-09-26 Thread Eungsop Yoo (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15525038#comment-15525038
 ] 

Eungsop Yoo commented on SOLR-9562:
---

{quote}
Thanks for contributing. I'm missing something... why is this metadata on a 
Collection Alias? What do Collection Aliases logically have to do with this 
feature? Wouldn't associating with the Shard be better, assuming a design in 
which there is one Collection & manual sharding?
{quote}
I run a SolrCloud cluster for indexing log data which has 10 billion docs a day 
and the log data are kept for 10 days. So I create a new collection per a day 
time frame and delete the oldest collection every day. Read and write aliases 
are created for those collections. I use 
[Banana|https://github.com/lucidworks/banana] to query from SolrCloud with read 
alias. I think that using read alias is the most transparent way for rolling 
collections for the Solr client such as Banana.
So some metadata are added to Alias.

{quote}
BTW I consider SimpleDateFormat and friends a dead API with the advent of Java 
8's new time API: 
https://docs.oracle.com/javase/8/docs/api/java/time/package-summary.html
{quote}
I see.

> Minimize queried collections for time series alias
> --
>
> Key: SOLR-9562
> URL: https://issues.apache.org/jira/browse/SOLR-9562
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Eungsop Yoo
>Priority: Minor
> Attachments: SOLR-9562.patch
>
>
> For indexing time series data(such as large log data), we can create a new 
> collection regularly(hourly, daily, etc.) with a write alias and create a 
> read alias for all of those collections. But all of the collections of the 
> read alias are queried even if we search over very narrow time window. In 
> this case, the docs to be queried may be stored in very small portion of 
> collections. So we don't need to do that.
> I suggest this patch for read alias to minimize queried collections. Three 
> parameters for CREATEALIAS action are added.
> || Key || Type || Required || Default || Description ||
> | timeField | string | No | | The time field name for time series data. It 
> should be date type. |
> | dateTimeFormat | string | No | | The format of timestamp for collection 
> creation. Every collection should has a suffix(start with "_") with this 
> format. 
> Ex. dateTimeFormat: MMdd, collectionName: col_20160927
> See 
> [SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
>  |
> | timeZone | string | No | | The time zone information for dateTimeFormat 
> parameter.
> Ex. GMT+9. 
> See 
> [SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
>  |
> And then when we query with filter query like this "timeField:\[fromTime TO 
> toTime\]", only the collections have the docs for a given time range will be 
> queried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9562) Minimize queried collections for time series alias

2016-09-26 Thread Eungsop Yoo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eungsop Yoo updated SOLR-9562:
--
Attachment: SOLR-9562.patch

> Minimize queried collections for time series alias
> --
>
> Key: SOLR-9562
> URL: https://issues.apache.org/jira/browse/SOLR-9562
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Eungsop Yoo
>Priority: Minor
> Attachments: SOLR-9562.patch
>
>
> For indexing time series data(such as large log data), we can create a new 
> collection regularly(hourly, daily, etc.) with a write alias and create a 
> read alias for all of those collections. But all of the collections of the 
> read alias are queried even if we search over very narrow time window. In 
> this case, the docs to be queried may be stored in very small portion of 
> collections. So we don't need to do that.
> I suggest this patch for read alias to minimize queried collections. Three 
> parameters for CREATEALIAS action are added.
> || Key || Type || Required || Default || Description ||
> | timeField | string | No | | The time field name for time series data. It 
> should be date type. |
> | dateTimeFormat | string | No | | The format of timestamp for collection 
> creation. Every collection should has a suffix(start with "_") with this 
> format. 
> Ex. dateTimeFormat: MMdd, collectionName: col_20160927
> See 
> [SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
>  |
> | timeZone | string | No | | The time zone information for dateTimeFormat 
> parameter.
> Ex. GMT+9. 
> See 
> [SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
>  |
> And then when we query with filter query like this "timeField:\[fromTime TO 
> toTime\]", only the collections have the docs for a given time range will be 
> queried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-9562) Minimize queried collections for time series alias

2016-09-26 Thread Eungsop Yoo (JIRA)
Eungsop Yoo created SOLR-9562:
-

 Summary: Minimize queried collections for time series alias
 Key: SOLR-9562
 URL: https://issues.apache.org/jira/browse/SOLR-9562
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Eungsop Yoo
Priority: Minor
 Attachments: SOLR-9562.patch

For indexing time series data(such as large log data), we can create a new 
collection regularly(hourly, daily, etc.) with a write alias and create a read 
alias for all of those collections. But all of the collections of the read 
alias are queried even if we search over very narrow time window. In this case, 
the docs to be queried may be stored in very small portion of collections. So 
we don't need to do that.

I suggest this patch for read alias to minimize queried collections. Three 
parameters for CREATEALIAS action are added.

|| Key || Type || Required || Default || Description ||
| timeField | string | No | | The time field name for time series data. It 
should be date type. |
| dateTimeFormat | string | No | | The format of timestamp for collection 
creation. Every collection should has a suffix(start with "_") with this 
format. 
Ex. dateTimeFormat: MMdd, collectionName: col_20160927
See 
[SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
 |
| timeZone | string | No | | The time zone information for dateTimeFormat 
parameter.
Ex. GMT+9. 
See 
[SimpleDateFormat|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html].
 |


And then when we query with filter query like this "timeField:\[fromTime TO 
toTime\]", only the collections have the docs for a given time range will be 
queried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9236) AutoAddReplicas feature with one replica loses some documents not committed during failover

2016-07-07 Thread Eungsop Yoo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eungsop Yoo updated SOLR-9236:
--
Description: 
I need to index huge amount of logs, so I decide to use AutoAddReplica feature 
with only one replica.
When using AutoAddReplicas with one replica, some benefits are expected.
- no redundant data files for replicas
-- saving disk usage
- best indexing performance 

I expected that Solr fails over just like HBase.
The feature worked almost as it was expected, except for some missing documents 
during failover.
I found two reasons for the missing.

1. The leader replica does not replay any transaction logs. But when there is 
only one replica, it should be the leader.
So I made the leader replica replay the transaction logs when using 
AutoAddReplicas with on replica.

But the above fix did not resolve the problem.

2. As failover occurred, the transaction log directory had a deeper directory 
depth. Just like this, tlog/tlog/tlog/...
The transaction log could not be replayed, because the transaction log 
directory was changed during failover. 
So I made the transaction log directory not changed during failover.

After these fixes, AutoAddReplicas with one replica fails over well without 
losing any documents.


  was:
I need to index huge amount of logs, so I decide to use AutoAddReplica feature 
with only one replica.
When using AutoAddReplicas with one replica, some benefits are expected.
- no redundant data files for replicas
-- saving disk usage
- best indexing performance 

I expected that Solr fails over just like HBase.
The feature worked almost as it was expected, except for some missing documents 
during failover.
I found two regions for the missing.

1. The leader replica does not replay any transaction logs. But when there is 
only one replica, it should be the leader.
So I made the leader replica replay the transaction logs when using 
AutoAddReplicas with on replica.

But the above fix did not resolve the problem.

2. As failover occurred, the transaction log directory had a deeper directory 
depth. Just like this, tlog/tlog/tlog/...
The transaction log could not be replayed, because the transaction log 
directory was changed during failover. 
So I made the transaction log directory not changed during failover.

After these fixes, AutoAddReplicas with one replica fails over well without 
losing any documents.



> AutoAddReplicas feature with one replica loses some documents not committed 
> during failover
> ---
>
> Key: SOLR-9236
> URL: https://issues.apache.org/jira/browse/SOLR-9236
> Project: Solr
>  Issue Type: Bug
>  Components: hdfs, SolrCloud
>Reporter: Eungsop Yoo
>Assignee: Mark Miller
>Priority: Minor
> Attachments: SOLR-9236.patch, SOLR-9236.patch
>
>
> I need to index huge amount of logs, so I decide to use AutoAddReplica 
> feature with only one replica.
> When using AutoAddReplicas with one replica, some benefits are expected.
> - no redundant data files for replicas
> -- saving disk usage
> - best indexing performance 
> I expected that Solr fails over just like HBase.
> The feature worked almost as it was expected, except for some missing 
> documents during failover.
> I found two reasons for the missing.
> 1. The leader replica does not replay any transaction logs. But when there is 
> only one replica, it should be the leader.
> So I made the leader replica replay the transaction logs when using 
> AutoAddReplicas with on replica.
> But the above fix did not resolve the problem.
> 2. As failover occurred, the transaction log directory had a deeper directory 
> depth. Just like this, tlog/tlog/tlog/...
> The transaction log could not be replayed, because the transaction log 
> directory was changed during failover. 
> So I made the transaction log directory not changed during failover.
> After these fixes, AutoAddReplicas with one replica fails over well without 
> losing any documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9236) AutoAddReplicas feature with one replica loses some documents not committed during failover

2016-06-30 Thread Eungsop Yoo (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358090#comment-15358090
 ] 

Eungsop Yoo commented on SOLR-9236:
---

LGTM

> AutoAddReplicas feature with one replica loses some documents not committed 
> during failover
> ---
>
> Key: SOLR-9236
> URL: https://issues.apache.org/jira/browse/SOLR-9236
> Project: Solr
>  Issue Type: Bug
>  Components: hdfs, SolrCloud
>Reporter: Eungsop Yoo
>Assignee: Mark Miller
>Priority: Minor
> Attachments: SOLR-9236.patch, SOLR-9236.patch
>
>
> I need to index huge amount of logs, so I decide to use AutoAddReplica 
> feature with only one replica.
> When using AutoAddReplicas with one replica, some benefits are expected.
> - no redundant data files for replicas
> -- saving disk usage
> - best indexing performance 
> I expected that Solr fails over just like HBase.
> The feature worked almost as it was expected, except for some missing 
> documents during failover.
> I found two regions for the missing.
> 1. The leader replica does not replay any transaction logs. But when there is 
> only one replica, it should be the leader.
> So I made the leader replica replay the transaction logs when using 
> AutoAddReplicas with on replica.
> But the above fix did not resolve the problem.
> 2. As failover occurred, the transaction log directory had a deeper directory 
> depth. Just like this, tlog/tlog/tlog/...
> The transaction log could not be replayed, because the transaction log 
> directory was changed during failover. 
> So I made the transaction log directory not changed during failover.
> After these fixes, AutoAddReplicas with one replica fails over well without 
> losing any documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9236) AutoAddReplicas feature with one replica loses some documents not committed during failover

2016-06-21 Thread Eungsop Yoo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eungsop Yoo updated SOLR-9236:
--
Attachment: SOLR-9236.patch

> AutoAddReplicas feature with one replica loses some documents not committed 
> during failover
> ---
>
> Key: SOLR-9236
> URL: https://issues.apache.org/jira/browse/SOLR-9236
> Project: Solr
>  Issue Type: Bug
>  Components: hdfs, SolrCloud
>Reporter: Eungsop Yoo
>Priority: Minor
> Attachments: SOLR-9236.patch
>
>
> I need to index huge amount of logs, so I decide to use AutoAddReplica 
> feature with only one replica.
> When using AutoAddReplicas with one replica, some benefits are expected.
> - no redundant data files for replicas
> -- saving disk usage
> - best indexing performance 
> I expected that Solr fails over just like HBase.
> The feature worked almost as it was expected, except for some missing 
> documents during failover.
> I found two regions for the missing.
> 1. The leader replica does not replay any transaction logs. But when there is 
> only one replica, it should be the leader.
> So I made the leader replica replay the transaction logs when using 
> AutoAddReplicas with on replica.
> But the above fix did not resolve the problem.
> 2. As failover occurred, the transaction log directory had a deeper directory 
> depth. Just like this, tlog/tlog/tlog/...
> The transaction log could not be replayed, because the transaction log 
> directory was changed during failover. 
> So I made the transaction log directory not changed during failover.
> After these fixes, AutoAddReplicas with one replica fails over well without 
> losing any documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-9236) AutoAddReplicas feature with one replica loses some documents not committed during failover

2016-06-21 Thread Eungsop Yoo (JIRA)
Eungsop Yoo created SOLR-9236:
-

 Summary: AutoAddReplicas feature with one replica loses some 
documents not committed during failover
 Key: SOLR-9236
 URL: https://issues.apache.org/jira/browse/SOLR-9236
 Project: Solr
  Issue Type: Bug
  Components: hdfs, SolrCloud
Reporter: Eungsop Yoo
Priority: Minor


I need to index huge amount of logs, so I decide to use AutoAddReplica feature 
with only one replica.
When using AutoAddReplicas with one replica, some benefits are expected.
- no redundant data files for replicas
-- saving disk usage
- best indexing performance 

I expected that Solr fails over just like HBase.
The feature worked almost as it was expected, except for some missing documents 
during failover.
I found two regions for the missing.

1. The leader replica does not replay any transaction logs. But when there is 
only one replica, it should be the leader.
So I made the leader replica replay the transaction logs when using 
AutoAddReplicas with on replica.

But the above fix did not resolve the problem.

2. As failover occurred, the transaction log directory had a deeper directory 
depth. Just like this, tlog/tlog/tlog/...
The transaction log could not be replayed, because the transaction log 
directory was changed during failover. 
So I made the transaction log directory not changed during failover.

After these fixes, AutoAddReplicas with one replica fails over well without 
losing any documents.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org