Re: Spark 1.6 Release window is not updated in Spark-wiki

2015-10-01 Thread Sean Owen
My guess is that the 1.6 merge window should close at the end of
November (2 months from now)? I can update it but wanted to check if
anyone else has a preferred tentative plan.

On Thu, Oct 1, 2015 at 2:20 AM, Meethu Mathew  wrote:
> Hi,
> In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the
> current release window has not been changed from 1.5. Can anybody give an
> idea of the expected dates for 1.6 version?
>
> Regards,
>
> Meethu Mathew
> Senior Engineer
> Flytxt
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Ewan Leith
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan


Re: Task Execution

2015-10-01 Thread Rishitesh Mishra
Depending upon the configured cores assigned to the executor, scheduler
will assign that many tasks. So yes they execute in parallel.
On 30 Sep 2015 14:51, "gsvic"  wrote:

> Concerning task execution, a worker executes its assigned tasks in parallel
> or sequentially?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Task-Execution-tp14411.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Spark 1.6 Release window is not updated in Spark-wiki

2015-10-01 Thread Patrick Wendell
BTW - the merge window for 1.6 is September+October. The QA window is
November and we'll expect to ship probably early december. We are on a
3 month release cadence, with the caveat that there is some
pipelining... as we finish release X we are already starting on
release X+1.

- Patrick

On Thu, Oct 1, 2015 at 11:30 AM, Patrick Wendell  wrote:
> Ah - I can update it. Usually i do it after the release is cut. It's
> just a standard 3 month cadence.
>
> On Thu, Oct 1, 2015 at 3:55 AM, Sean Owen  wrote:
>> My guess is that the 1.6 merge window should close at the end of
>> November (2 months from now)? I can update it but wanted to check if
>> anyone else has a preferred tentative plan.
>>
>> On Thu, Oct 1, 2015 at 2:20 AM, Meethu Mathew  
>> wrote:
>>> Hi,
>>> In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the
>>> current release window has not been changed from 1.5. Can anybody give an
>>> idea of the expected dates for 1.6 version?
>>>
>>> Regards,
>>>
>>> Meethu Mathew
>>> Senior Engineer
>>> Flytxt
>>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.6 Release window is not updated in Spark-wiki

2015-10-01 Thread Patrick Wendell
Ah - I can update it. Usually i do it after the release is cut. It's
just a standard 3 month cadence.

On Thu, Oct 1, 2015 at 3:55 AM, Sean Owen  wrote:
> My guess is that the 1.6 merge window should close at the end of
> November (2 months from now)? I can update it but wanted to check if
> anyone else has a preferred tentative plan.
>
> On Thu, Oct 1, 2015 at 2:20 AM, Meethu Mathew  
> wrote:
>> Hi,
>> In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the
>> current release window has not been changed from 1.5. Can anybody give an
>> idea of the expected dates for 1.6 version?
>>
>> Regards,
>>
>> Meethu Mathew
>> Senior Engineer
>> Flytxt
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Reynold Xin
You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
wrote:

> Hi all,
>
>
>
> We really like the ability to infer a schema from JSON contained in an
> RDD, but when we’re using Spark Streaming on small batches of data, we
> sometimes find that Spark infers a more specific type than it should use,
> for example if the json in that small batch only contains integer values
> for a String field, it’ll class the field as an Integer type on one
> Streaming batch, then a String on the next one.
>
>
>
> Instead, we’d rather match every value as a String type, then handle any
> casting to a desired type later in the process.
>
>
>
> I don’t think there’s currently any simple way to avoid this that I can
> see, but we could add the functionality in the JacksonParser.scala file,
> probably in convertField.
>
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
>
>
>
> Does anyone know an easier and cleaner way to do this?
>
>
>
> Thanks,
>
> Ewan
>


Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Yin Huai
Hi Ewan,

For your use case, you only need the schema inference to pick up the
structure of your data (basically you want spark sql to infer the type of
complex values like arrays and structs but keep the type of primitive
values as strings), right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
wrote:

> We could, but if a client sends some unexpected records in the schema
> (which happens more than I'd like, our schema seems to constantly evolve),
> its fantastic how Spark picks up on that data and includes it.
>
>
> Passing in a fixed schema loses that nice additional ability, though it's
> what we'll probably have to adopt if we can't come up with a way to keep
> the inference working.
>
>
> Thanks,
>
> Ewan
>
>
> -- Original message--
>
> *From: *Reynold Xin
>
> *Date: *Thu, 1 Oct 2015 22:12
>
> *To: *Ewan Leith;
>
> *Cc: *dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
> You can pass the schema into json directly, can't you?
>
> On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
> wrote:
>
>> Hi all,
>>
>>
>>
>> We really like the ability to infer a schema from JSON contained in an
>> RDD, but when we’re using Spark Streaming on small batches of data, we
>> sometimes find that Spark infers a more specific type than it should use,
>> for example if the json in that small batch only contains integer values
>> for a String field, it’ll class the field as an Integer type on one
>> Streaming batch, then a String on the next one.
>>
>>
>>
>> Instead, we’d rather match every value as a String type, then handle any
>> casting to a desired type later in the process.
>>
>>
>>
>> I don’t think there’s currently any simple way to avoid this that I can
>> see, but we could add the functionality in the JacksonParser.scala file,
>> probably in convertField.
>>
>>
>>
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
>>
>>
>>
>> Does anyone know an easier and cleaner way to do this?
>>
>>
>>
>> Thanks,
>>
>> Ewan
>>
>
>


Re: Tungsten off heap memory access for C++ libraries

2015-10-01 Thread Paul Wais
Update for those who are still interested: djinni is a nice tool for
generating Java/C++ bindings.  Before today djinni's Java support was only
aimed at Android, but now djinni works with (at least) Debian, Ubuntu, and
CentOS.

djinni will help you run C++ code in-process with the caveat that djinni
only supports deep-copies of on-JVM-heap data (and no special off-heap
features yet).  However, you can in theory use Unsafe to get pointers to
off-heap memory and pass those (as ints) to native code.  

So if you need a solution *today*,  try checking out a small demo:
https://github.com/dropbox/djinni/tree/master/example/localhost

For the long deets, see:
 https://github.com/dropbox/djinni/pull/140



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-tp13898p14427.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Ewan Leith
We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.


Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.


Thanks,

Ewan


-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan



Spark 1.6 Release window is not updated in Spark-wiki

2015-10-01 Thread Meethu Mathew
Hi,
In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the
current release window has not been changed from 1.5. Can anybody give an
idea of the expected dates for 1.6 version?

Regards,

Meethu Mathew
Senior Engineer
Flytxt


Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Ewan Leith
Exactly, that's a much better way to put it.


Thanks,

Ewan


-- Original message--

From: Yin Huai

Date: Thu, 1 Oct 2015 23:54

To: Ewan Leith;

Cc: r...@databricks.com;dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hi Ewan,

For your use case, you only need the schema inference to pick up the structure 
of your data (basically you want spark sql to infer the type of complex values 
like arrays and structs but keep the type of primitive values as strings), 
right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
> wrote:

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.


Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.


Thanks,

Ewan


-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan




[ANNOUNCE] Announcing Spark 1.5.1

2015-10-01 Thread Reynold Xin
Hi All,

Spark 1.5.1 is a maintenance release containing stability fixes. This
release is based on the branch-1.5 maintenance branch of Spark. We
*strongly recommend* all 1.5.0 users to upgrade to this release.

The full list of bug fixes is here: http://s.apache.org/spark-1.5.1

http://spark.apache.org/releases/spark-release-1-5-1.html


(note: it can take a few hours for everything to be propagated, so you
might get 404 on some download links, but everything should be in maven
central already)