Re: sampling function

2016-07-09 Thread Greg Hogan
Hi Do,

DataSet provides a stable @Public interface. DataSetUtils is marked
@PublicEvolving which is intended for public use, has stable behavior, but
method signatures may change. It's also good to limit DataSet to common
methods whereas the utility methods tend to be used for specific
applications.

I don't have the pulse of streaming but this sounds like a useful feature
that could be added.

Greg

On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do  wrote:

> Hi all,
>
> I'm working on approximate computing using sampling techniques. I
> recognized that Flink supports the sample function for Dataset
> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just wondering why
> you didn't merge the function to org/apache/flink/api/java/DataSet.java
> since the sample function works as a transformation operator?
>
> The second question is that are you planning to support the sample
> function for DataStream (within windows) since I did not see it in
> DataStream code ?
>
> Thank you,
> Do
>


sampling function

2016-07-09 Thread Le Quoc Do
Hi all,

I'm working on approximate computing using sampling techniques. I
recognized that Flink supports the sample function for Dataset
(org/apache/flink/api/java/utils/DataSetUtils.java). I'm just wondering why
you didn't merge the function to org/apache/flink/api/java/DataSet.java
since the sample function works as a transformation operator?

The second question is that are you planning to support the sample function
for DataStream (within windows) since I did not see it in DataStream code ?

Thank you,
Do


Re: Random access to small global state

2016-07-09 Thread Suneel Marthi
U could use ignite too, I believe they have a plugin for flink streaming.

Sent from my iPhone

> On Jul 9, 2016, at 8:05 AM, Sebastian  wrote:
> 
> Hi,
> 
> I'm planning to work on a streaming recommender in Flink, and one problem 
> that I have is that the algorithm needs random access to a small global state 
> (say a million counts). It should be ok if there is some inconsistency in the 
> state (e.g., delay in seeing updates).
> 
> Does anyone here have experience with such things? I'm thinking of connecting 
> Flink to a lighweight in-memory key-value store such as memcache for that.
> 
> Best,
> Sebastian


Random access to small global state

2016-07-09 Thread Sebastian

Hi,

I'm planning to work on a streaming recommender in Flink, and one 
problem that I have is that the algorithm needs random access to a small 
global state (say a million counts). It should be ok if there is some 
inconsistency in the state (e.g., delay in seeing updates).


Does anyone here have experience with such things? I'm thinking of 
connecting Flink to a lighweight in-memory key-value store such as 
memcache for that.


Best,
Sebastian


Re: Extract type information from SortedMap

2016-07-09 Thread Yukun Guo
Hi Robert,

On 9 July 2016 at 00:25, Robert Metzger  wrote:

> Hi Yukun,
>
> can you also post the code how you are invoking the GenericFlatMapper on
> the mailing list?
>

Here is the code defining the topology:

DataStream stream = ...;
stream
.keyBy(new KeySelector() {
@Override
public Integer getKey(String x) throws Exception {
return x.hashCode() % 10;
}
})
.timeWindow(Time.seconds(10))
.fold(new TreeMap(), new FoldFunction>() {
@Override
public SortedMap fold(SortedMap map,
String x) {
Long current = map.get(x);
Long updated = current != null ? current + 1 : 1;
map.put(x, updated);
return map;
}
})
.flatMap(new GenericFlatMapper())
.returns(new TypeHint>(){}.getTypeInfo())
 // throws InvalidTypesException if you comment out this line
.print();



>
> The Java compiler is usually dropping the generic types during compilation
> ("type erasure"), that's why we can not infer the types.
>
>
The error message implies type extraction should be possible when "all
variables in the return type can be deduced from the input type(s)". This
is true for flatMap(Tuple2, Collector>), but if
the signature is changed to void flatMap(SortedMap,
Collector>), type inference fails.


>
> On Fri, Jul 8, 2016 at 12:27 PM, Yukun Guo  wrote:
>
>> Hi,
>> When I run the code implementing a generic FlatMapFunction, Flink
>> complained about InvalidTypesException:
>>
>> public class GenericFlatMapper implements FlatMapFunction> Long>, Tuple2> {
>> @Override
>> public void flatMap(SortedMap m, Collector> 
>> out) throws Exception {
>> for (Map.Entry entry : m.entrySet()) {
>> out.collect(Tuple2.of(entry.getKey(), entry.getValue()));
>> }
>> }
>> }
>>
>>
>> *Exception in thread "main"
>> org.apache.flink.api.common.functions.InvalidTypesException: The return
>> type of function could not be determined automatically, due to type
>> erasure. You can give type information hints by using the returns(...)
>> method on the result of the transformation call, or by letting your
>> function implement the 'ResultTypeQueryable' interface.*
>>
>> *...*
>> *Caused by: org.apache.flink.api.common.functions.InvalidTypesException:
>> Type of TypeVariable 'T' in 'class GenericFlatMapper' could not be
>> determined. This is most likely a type erasure problem. The type extraction
>> currently supports types with generic variables only in cases where all
>> variables in the return type can be deduced from the input type(s).*
>>
>> This puzzles me as Flink should be able to infer the type from arguments.
>> I know returns(...) or other workarounds to give type hint, but they are
>> kind of verbose. Any suggestions?
>>
>>
>


Modifying start-cluster scripts to efficiently spawn multiple TMs

2016-07-09 Thread Saliya Ekanayake
Hi,

The current start/stop scripts SSH worker nodes each time they appear in
the slaves file. When spawning multiple TMs (like 24 per node), this is
very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number
of TMs afterwards. I can make a pull request if this seems usable to
others. For now, I assume slaves file will indicate the number of TMs per
slave in "IP N" format.

Thank you,
Saliya

-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington