[jira] [Created] (GOBBLIN-865) Add feature that enables PK-chunking in partition

2019-08-28 Thread Alex Li (Jira)
Alex Li created GOBBLIN-865:
---

 Summary: Add feature that enables PK-chunking in partition 
 Key: GOBBLIN-865
 URL: https://issues.apache.org/jira/browse/GOBBLIN-865
 Project: Apache Gobblin
  Issue Type: Task
Reporter: Alex Li


In SFDC(salesforce) connector, we have partitioning mechanisms to split a giant 
query to multiple sub queries. There are 3 mechanisms:
 * simple partition (equally split by time)
 * dynamic pre-partition (generate histogram and split by row numbers)
 * user specified partition (set up time range in job file)

However there are tables like Task and Contract are failing time to time to 
fetch full data.

We may want to utilize PK-chunking to partition the query.

 

The pk-chunking doc from SFDC - 
[https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/async_api_headers_enable_pk_chunking.htm]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (GOBBLIN-995) Add function to instantiate the BulkConnection in SFDC connector

2019-12-05 Thread Alex Li (Jira)
Alex Li created GOBBLIN-995:
---

 Summary: Add function to instantiate the BulkConnection in SFDC 
connector
 Key: GOBBLIN-995
 URL: https://issues.apache.org/jira/browse/GOBBLIN-995
 Project: Apache Gobblin
  Issue Type: New Feature
Reporter: Alex Li


In SalesforceExtractor class, we instantiated BulkConnection directly.
{code:java}
  this.bulkConnection = new BulkConnection(config);
{code}
This code makes it is impossible to inject a customized BulkConnection.
In contrast, httpClient was instantiated in a function. We could extend the 
class and override the function to return a customized httpClient 
(GaapHttPClient in our case)

We should add a function to instantiate the BulkConnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (GOBBLIN-1025) Add retry for PK-Chunking iterator

2020-01-14 Thread Alex Li (Jira)
Alex Li created GOBBLIN-1025:


 Summary: Add retry for PK-Chunking iterator
 Key: GOBBLIN-1025
 URL: https://issues.apache.org/jira/browse/GOBBLIN-1025
 Project: Apache Gobblin
  Issue Type: Improvement
Reporter: Alex Li


In SFDC connector, there is a class called `ResultIterator` (I will change the 
name to SalesforceRecordIterator).
It was using by only PK-Chunking currently. It encapsulated fetching a list of 
result files to a record iterator.

However, the csvReader.nextRecord() may throw out network IO exception. We 
should do retry in this case.

When a result file is fetched partly and one network IO exception happens, we 
are in a special situation - first half of the file is already fetched to our 
local, but another half of the file is still on datasource. 
We need to
1. reopen the file stream
2. skip all the records that we already fetched, seek the cursor to the record 
which we haven't fetched yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (GOBBLIN-1070) gobblin json converter not able to take in union with more than 2 types

2020-03-04 Thread Alex Li (Jira)
Alex Li created GOBBLIN-1070:


 Summary: gobblin json converter not able to take in union with 
more than 2 types
 Key: GOBBLIN-1070
 URL: https://issues.apache.org/jira/browse/GOBBLIN-1070
 Project: Apache Gobblin
  Issue Type: Improvement
Reporter: Alex Li


gobblin-core has a converter that is hard-coded to take only 2 types if the 
field type is union. The ideal behavior is to support an arbitrary number of 
types.

This is blocking Zendesk's tickets dataset from being ingested as there's a 
column that has 4+types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (GOBBLIN-1097) ResultChainingIterator.add should check if the argument iterator is null

2020-03-24 Thread Alex Li (Jira)
Alex Li created GOBBLIN-1097:


 Summary: ResultChainingIterator.add should check if the argument 
iterator is null
 Key: GOBBLIN-1097
 URL: https://issues.apache.org/jira/browse/GOBBLIN-1097
 Project: Apache Gobblin
  Issue Type: Bug
Reporter: Alex Li


ResultChainingIterator.add should check if the argument iterator is null.

It fails, if the argument is null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (GOBBLIN-1101) Enhance bulk api retry for ExceedQuota

2020-03-30 Thread Alex Li (Jira)
Alex Li created GOBBLIN-1101:


 Summary: Enhance bulk api retry for ExceedQuota
 Key: GOBBLIN-1101
 URL: https://issues.apache.org/jira/browse/GOBBLIN-1101
 Project: Apache Gobblin
  Issue Type: Bug
Reporter: Alex Li


1. ExceedQuota exception

Below is SFDC doc about ExceedQuota
{code:java}
One of the limits customers frequently reach is the concurrent request limit. 
Once a synchronous Apex request runs longer than 5 seconds, it begins counting 
against this limit. Each organization is allowed 10 concurrent long-running 
requests. If the limit is reached, any new synchronous Apex request results in 
a runtime exception. This behavior occurs until the organization’s requests are 
below the limit.
{code}
If the ExceedQuota exception happens, we should let the thread sleep 5 minutes 
and try again. There should not be a retryLimit for this exception.

2. Except stack in log file

For example we set up retryLimit to 10, we retried 10 times,  and failed; we 
need to print out exception stack in log file, there are 10 of them in the 
exception stack.

SSL Exception(root cause) retry and get > ExceedQuota retry and get 
>  ExceedQuota a lot > 

We'd better skip all the retry exception, only keep the root cause exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (GOBBLIN-1179) Add a typed config to replace properties

2020-06-02 Thread Alex Li (Jira)
Alex Li created GOBBLIN-1179:


 Summary: Add a typed config to replace properties
 Key: GOBBLIN-1179
 URL: https://issues.apache.org/jira/browse/GOBBLIN-1179
 Project: Apache Gobblin
  Issue Type: Task
Reporter: Alex Li


Add a typed config to replace *_properties.get(“ini.file.userName”)_* with 
*config.userName*

The gobblin config file is an ini file. Java loads the ini file to a properties 
instance. The way to use the config information is to get from the properties.
|workUnitState.getPropAsBoolean(BULK_API_USE_QUERY_ALL)|
|workUnitState.getPropAsInt(FETCH_RETRY_LIMIT_KEY, DEFAULT_FETCH_RETRY_LIMIT)|
|Math.max(MIN_SIZE,Math.min(MAX_SIZE, 
workUnitState.getPropAsInt(PARTITION_SIZE, DEFAULT_SIZE))); 
// partition size must be >= min and <= max, otherwise use default|

Problems
 # No consistent key naming model
 * A long dot-separated key string is used, easy to run into typos. The config 
code is pretty verbose: We use *properties.getProp(key, default)*
 * Key collision if the same type is used in multiple places, e.g kafka.brokers


 # No ownership management
 * in gobblin connector package: We have multiple constant static classes. 
GobblinKeys, QueryBaseKeys, GaapKeys, and SalesforceConnectorKeys.We can even 
directly read config values by state.getProp(*“my.key”*) without creating any 
constant key.


 # No static validation
 * Required & default value
 * Type check
 * Date range
 * Enum


 # No dependency check
 * If users set to *useGaap=true*, there must be *gaap.url* and 
*gaap.credential*. And this needs to be verified at both runtime and compile 
time.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (GOBBLIN-1186) Fix SFDC source.querybased.salesforce.is.soft.deletes.pull.disabled not available for simple mode

2020-06-09 Thread Alex Li (Jira)
Alex Li created GOBBLIN-1186:


 Summary: Fix SFDC 
source.querybased.salesforce.is.soft.deletes.pull.disabled not available for 
simple mode
 Key: GOBBLIN-1186
 URL: https://issues.apache.org/jira/browse/GOBBLIN-1186
 Project: Apache Gobblin
  Issue Type: Bug
Reporter: Alex Li


*Problem statement*
source.querybased.salesforce.is.soft.deletes.pull.disabled
doesn’t work for simple mode, it works only for dynamic mode.
the reason is - we explicitly set up the key-value for the dynamic mode
[https://github.com/hanghangliu/gobblin/blob/9029a89b85ef373f78d603b14d6aaa75998f3356/gobblin-salesforce/src/main/java/org/apache/gobblin/salesforce/SalesforceSource.java#L327]
 
*Root cause*
The extract state is blank(please see code)
What we set up in job file is not able to see in extractor state.
[https://github.com/hashdoop/hashdoop-incubator-gobblin/blob/a871e5c5d6f539bcfbcc4e2850685c58dd72dd1a/gobblin-core/src/main/java/org/apache/gobblin/source/extractor/extract/QueryBasedSource.java#L234]
 
*Solution:*
explicitly set up the {{soft.deletes.pull.disabled}} for simple mode, as we did 
for dynamic mode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (GOBBLIN-1188) Fix log message for SFDC iterators

2020-06-10 Thread Alex Li (Jira)
Alex Li created GOBBLIN-1188:


 Summary: Fix log message for SFDC iterators
 Key: GOBBLIN-1188
 URL: https://issues.apache.org/jira/browse/GOBBLIN-1188
 Project: Apache Gobblin
  Issue Type: Bug
Reporter: Alex Li


We printed out same message twice. 

remove it from hasNext().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (GOBBLIN-1188) Fix log message for SFDC iterators

2020-06-10 Thread Alex Li (Jira)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Li updated GOBBLIN-1188:
-
Priority: Minor  (was: Major)

> Fix log message for SFDC iterators
> --
>
> Key: GOBBLIN-1188
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1188
> Project: Apache Gobblin
>  Issue Type: Bug
>Reporter: Alex Li
>Priority: Minor
>
> We printed out same message twice. 
> remove it from hasNext().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (GOBBLIN-1202) Add retry for REST API call

2020-06-18 Thread Alex Li (Jira)
Alex Li created GOBBLIN-1202:


 Summary: Add retry for REST API call
 Key: GOBBLIN-1202
 URL: https://issues.apache.org/jira/browse/GOBBLIN-1202
 Project: Apache Gobblin
  Issue Type: Improvement
Reporter: Alex Li


SFDC objects have index on their column - *SystemModstamp*

This index could be in disk. When we execute  
{code:java}
Select count(systemmodstamp) from table_name group by day_only(systemmodstamp)
{code}
If the index is in disk, it needs to load. It would be timeout.

Retry would result it.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (GOBBLIN-1208) Fix - restApiRetryLimit cannot be set to 0

2020-06-29 Thread Alex Li (Jira)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Li updated GOBBLIN-1208:
-
Description: 
Fix - restApiRetryLimit cannot be set to 0

if we set restApiRetryLimit=0, the code should be execute 1 time.

change code:
{code:java}
  private JsonArray getRecordsForQuery(SalesforceConnector connector, String 
query) {
RestApiProcessingException exception = null;
for (int i = 0; i < workUnitConf.restApiRetryLimit; i++) {
{code}
to 
{code:java}
  private JsonArray getRecordsForQuery(SalesforceConnector connector, String 
query) { 
RestApiProcessingException exception = null; 
for (int i = 0; i < workUnitConf.restApiRetryLimit+1; i++) {
{code}

  was:
Fix - restApiRetryLimit cannot be set to 0

if we set restApiRetryLimit=0, the code should be execute 1 time.

change code:
{code:java}
  private JsonArray getRecordsForQuery(SalesforceConnector connector, String 
query) {
RestApiProcessingException exception = null;
for (int i = 0; i < workUnitConf.restApiRetryLimit; i++) {
{code}
to 
{code:java}
private JsonArray getRecordsForQuery(SalesforceConnector connector, String 
query) { RestApiProcessingException exception = null; for (int i = 0; i < 
workUnitConf.restApiRetryLimit+1; i++) {
{code}


> Fix - restApiRetryLimit cannot be set to 0
> --
>
> Key: GOBBLIN-1208
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1208
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Alex Li
>Priority: Major
>
> Fix - restApiRetryLimit cannot be set to 0
> if we set restApiRetryLimit=0, the code should be execute 1 time.
> change code:
> {code:java}
>   private JsonArray getRecordsForQuery(SalesforceConnector connector, String 
> query) {
> RestApiProcessingException exception = null;
> for (int i = 0; i < workUnitConf.restApiRetryLimit; i++) {
> {code}
> to 
> {code:java}
>   private JsonArray getRecordsForQuery(SalesforceConnector connector, String 
> query) { 
> RestApiProcessingException exception = null; 
> for (int i = 0; i < workUnitConf.restApiRetryLimit+1; i++) {
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (GOBBLIN-1208) Fix - restApiRetryLimit cannot be set to 0

2020-06-29 Thread Alex Li (Jira)
Alex Li created GOBBLIN-1208:


 Summary: Fix - restApiRetryLimit cannot be set to 0
 Key: GOBBLIN-1208
 URL: https://issues.apache.org/jira/browse/GOBBLIN-1208
 Project: Apache Gobblin
  Issue Type: Improvement
Reporter: Alex Li


Fix - restApiRetryLimit cannot be set to 0

if we set restApiRetryLimit=0, the code should be execute 1 time.

change code:
{code:java}
  private JsonArray getRecordsForQuery(SalesforceConnector connector, String 
query) {
RestApiProcessingException exception = null;
for (int i = 0; i < workUnitConf.restApiRetryLimit; i++) {
{code}
to 
{code:java}
private JsonArray getRecordsForQuery(SalesforceConnector connector, String 
query) { RestApiProcessingException exception = null; for (int i = 0; i < 
workUnitConf.restApiRetryLimit+1; i++) {
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)