Re: Structured Streaming with Kafka sources/sinks

2016-08-15 Thread Cody Koeninger
https://issues.apache.org/jira/browse/SPARK-15406

I'm not working on it (yet?), never got an answer to the question of
who was planning to work on it.

On Mon, Aug 15, 2016 at 9:12 PM, Guo, Chenzhao  wrote:
> Hi all,
>
>
>
> I’m trying to write Structured Streaming test code and will deal with Kafka
> source. Currently Spark 2.0 doesn’t support Kafka sources/sinks.
>
>
>
> I found some Databricks slides saying that Kafka sources/sinks will be
> implemented in Spark 2.0, so is there anybody working on this? And when will
> it be released?
>
>
>
> Thanks,
>
> Chenzhao Guo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Structured Streaming with Kafka sources/sinks

2016-08-15 Thread Guo, Chenzhao
Hi all,

I'm trying to write Structured Streaming test code and will deal with Kafka 
source. Currently Spark 2.0 doesn't support Kafka sources/sinks.

I found some Databricks slides saying that Kafka sources/sinks will be 
implemented in Spark 2.0, so is there anybody working on this? And when will it 
be released?

Thanks,
Chenzhao Guo


Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-15 Thread Jacek Laskowski
Thanks Sean. That reflects my sentiments so well!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, Aug 15, 2016 at 1:08 AM, Sean Owen  wrote:
> I believe Chris was being a bit facetious.
>
> The ASF guidance is right, that it's important people don't consume
> non-blessed snapshot builds as like other releases. The intended
> audience is developers and so the easiest default policy is to only
> advertise the snapshots where only developers are likely to be
> looking.
>
> That said, they're not secret or confidential, and while this probably
> should go to dev@, it's not a sin to mention the name of snapshots on
> user@, as long as these disclaimers are clear too. I'd rather a user
> understand the full picture, than find the snapshots and not
> understand any of the context.
>
> On Mon, Aug 15, 2016 at 2:11 AM, Jacek Laskowski  wrote:
>> Hi Chris,
>>
>> With my ASF member hat on...
>>
>> Oh, come on, Chris. It's not "in violation of ASF policies"
>> whatsoever. Policies are for ASF developers not for users. Honestly, I
>> was surprised to read the note in Mark Hamstra's email. It's very
>> restrictive but it says about what committers and PMCs should do not
>> users:
>>
>> "Do not include any links on the project website that might encourage
>> non-developers to download and use nightly builds, snapshots, release
>> candidates, or any other similar package."
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Number of tasks on executors become negative after executor failures

2016-08-15 Thread Rachana Srivastava
Summary:
I am running Spark 1.5 on CDH5.5.1.  Under extreme load intermittently I am 
getting this connection failure exception and later negative executor in the 
Spark UI.

Exception:
TRACE: org.apache.hadoop.hbase.ipc.AbstractRpcClient - Call: Multi, callTime: 
76ms
INFO : org.apache.spark.network.client.TransportClientFactory - Found inactive 
connection to /xxx.xxx.xxx., creating a new one.
ERROR: org.apache.spark.network.shuffle.RetryingBlockFetcher - Exception while 
beginning fetch of 1 outstanding blocks (after 1 retries)
java.io.IOException: Failed to connect to /xxx.xxx.xxx.
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /xxx.xxx.xxx.
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more


Related Defects:
https://issues.apache.org/jira/browse/SPARK-2319
https://issues.apache.org/jira/browse/SPARK-9591


[cid:image001.png@01D1F6EE.1CCFE110]


Re: How to resolve the SparkExecption : Size exceeds Integer.MAX_VALUE

2016-08-15 Thread Ewan Leith
I think this is more suited to the user mailing list than the dev one, but this 
almost always means you need to repartition your data into smaller partitions 
as one of the partitions is over 2GB.

When you create your dataset, put something like . repartition(1000) at the end 
of the command creating the initial dataframe or dataset.

Ewan

On 15 Aug 2016 17:46, Minudika Malshan  wrote:
Hi all,

I am trying to create and train a model for a Kaggle competition dataset using 
Apache spark. The dataset has more than 10 million rows of data.
But when training the model, I get an exception "Size exceeds 
Integer.MAX_VALUE".

I found the same question has been raised in Stack overflow but those answers 
didn't help much.

It would be a great if you could help to resolve this issue.

Thanks.
Minudika


This email and any attachments to it may contain confidential information and 
are intended solely for the addressee and. If you are not the intended 
recipient of this email or if you have believe you have received this email in 
error, please contact the sender and remove it from your system. Do not use, 
copy or disclose the information contained in this email or in any attachment. 
RealityMine Limited may monitor email traffic data. RealityMine Limited may 
monitor email traffic data and also the content of email for the purposes of 
security. RealityMine Limited is a company registered in England and Wales. 
Registered number: 07920936 Registered office: Warren Bruce Court, Warren Bruce 
Road, Trafford Park, Manchester M17 1LB


How to resolve the SparkExecption : Size exceeds Integer.MAX_VALUE

2016-08-15 Thread Minudika Malshan
Hi all,

I am trying to create and train a model for a Kaggle competition dataset
using Apache spark. The dataset has more than 10 million rows of data.
But when training the model, I get an exception "*Size exceeds
Integer.MAX_VALUE*".

I found the same question has been raised in Stack overflow but those
answers didn't help much.

It would be a great if you could help to resolve this issue.

Thanks.
Minudika


Spark hangs after OOM in Serializer

2016-08-15 Thread mikhainin
Hi guys, 

I'm using Spark 1.6.2 and faced some problem so I kindly ask you to help.
Sometimes, when DAGScheduler tries to serialise pair  OOM
exception is thrown inside closureSerializer.serialize() call (you may see a
stack-trace below). But it isn't a problem itself, the problem is that Spark
hangs after this has happened.

I've fixed the problem by adding OOM handling to try-catch statement inside
submitMissingTasks() function and now when this happen, Spark is correctly
finishes its work. But I noticed that Non-fatal are handling and Spark abort
the task when such ones happen and there are a lot of places where
NonFatal(e) is handled. So it looks like other types of errors are
deliberately ignored. 

And as long as I don't clearly understand the reason why so, I'm not sure
that the fix is correct. Could you please have a look and point me to a
better solution for the issue?
The fix:  abort-task-on-oom-in-dag-scheduler.patch

  


at
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1016)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
at
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:861)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1611)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1603)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1592)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:920)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at 
com.massivedatascience.util.SparkHelper$class.sync(SparkHelper.scala:39)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans.sync(ColumnTrackingKMeans.scala:255)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans$$anonfun$11.apply(ColumnTrackingKMeans.scala:457)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans$$anonfun$11.apply(ColumnTrackingKMeans.scala:456)
at
com.massivedatascience.util.SparkHelper$class.withBroadcast(SparkHelper.scala:83)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans.withBroadcast(ColumnTrackingKMeans.scala:255)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans.com$massivedatascience$clusterer$ColumnTrackingKMeans$$lloyds$1(ColumnTrackingKMeans.scala:456)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans$$anonfun$cluster$3$$anonfun$apply$5.apply(ColumnTrackingKMeans.scala:485)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans$$anonfun$cluster$3$$anonfun$apply$5.apply(ColumnTrackingKMeans.scala:480)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans$$anonfun$cluster$3.apply(ColumnTrackingKMeans.scala:480)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans$$anonfun$cluster$3.apply(ColumnTrackingKMeans.scala:479)
at
com.massivedatascience.util.SparkHelper$class.withCached(SparkHelper.scala:71)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans.withCached(ColumnTrackingKMeans.scala:255)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans.cluster(ColumnTrackingKMeans.scala:479)
at
com.massivedatascience.clusterer.MultiKMeansClusterer$class.best(MultiKMeansClusterer.scala:37)
at
com.massivedatascience.clusterer.ColumnTrackingKMeans.best(ColumnTrackingKMeans.scala:255)
at 
com.massivedatascience.clusterer.KMeans$.simpleTrain(KMeans.scala:168)
at

Re: Welcoming Felix Cheung as a committer

2016-08-15 Thread mayur bhole
Congrats Felix!

On Mon, Aug 15, 2016 at 2:57 PM, Paul Roy  wrote:

> Congrats Felix
>
> Paul Roy.
>
> On Mon, Aug 8, 2016 at 9:15 PM, Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The PMC recently voted to add Felix Cheung as a committer. Felix has been
>> a major contributor to SparkR and we're excited to have him join
>> officially. Congrats and welcome, Felix!
>>
>> Matei
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> "Change is slow and gradual. It requires hardwork, a bit of
> luck, a fair amount of self-sacrifice and a lot of patience."
>
> Roy.
>


Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-15 Thread Steve Loughran

As well as the legal issue 'nightly builds haven't been through the strict 
review and license check process for ASF releases', and the engineering issue 
'release off a nightly and your users will hate you', there's an ASF community 
one: ASF projects want to build a dev community as well as a user one —and 
encouraging users to jump to coding for/near a project is a way to do this.


1. Anyone on user@ lists are strongly encouraged to get on the dev@ lists, not 
just to contribute code but to contribute to discourse on project direction, 
comment on issues, review other code.

2. Its really good if someone builds/tests their downstream apps against 
pre-releases; catching problems early is the best way to fix them.

3. it really, really helps if the people doing build/tess of downstream apps 
have their own copy of the source code, in sync with the snapshots they use. 
That puts them in the place to start debugging any problems which surface, 
identify if it is a bug in their own code surfacing, vs a regression in the 
dependency —and, if it is the latter, they are in the position to start working 
on a fix * and test it in the exact environment where the problem arises*

That's why you wan't to restrict these snapshots to developers: it's not "go 
away, user", it's "come and join the developers'


> On 15 Aug 2016, at 09:08, Sean Owen  wrote:
> 
> I believe Chris was being a bit facetious.
> 
> The ASF guidance is right, that it's important people don't consume
> non-blessed snapshot builds as like other releases. The intended
> audience is developers and so the easiest default policy is to only
> advertise the snapshots where only developers are likely to be
> looking.
> 
> That said, they're not secret or confidential, and while this probably
> should go to dev@, it's not a sin to mention the name of snapshots on
> user@, as long as these disclaimers are clear too. I'd rather a user
> understand the full picture, than find the snapshots and not
> understand any of the context.
> 
> On Mon, Aug 15, 2016 at 2:11 AM, Jacek Laskowski  wrote:
>> Hi Chris,
>> 
>> With my ASF member hat on...
>> 
>> Oh, come on, Chris. It's not "in violation of ASF policies"
>> whatsoever. Policies are for ASF developers not for users. Honestly, I
>> was surprised to read the note in Mark Hamstra's email. It's very
>> restrictive but it says about what committers and PMCs should do not
>> users:
>> 
>> "Do not include any links on the project website that might encourage
>> non-developers to download and use nightly builds, snapshots, release
>> candidates, or any other similar package."
>> 
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 



Re: Welcoming Felix Cheung as a committer

2016-08-15 Thread Paul Roy
Congrats Felix

Paul Roy.

On Mon, Aug 8, 2016 at 9:15 PM, Matei Zaharia 
wrote:

> Hi all,
>
> The PMC recently voted to add Felix Cheung as a committer. Felix has been
> a major contributor to SparkR and we're excited to have him join
> officially. Congrats and welcome, Felix!
>
> Matei
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
"Change is slow and gradual. It requires hardwork, a bit of
luck, a fair amount of self-sacrifice and a lot of patience."

Roy.


Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-15 Thread Sean Owen
I believe Chris was being a bit facetious.

The ASF guidance is right, that it's important people don't consume
non-blessed snapshot builds as like other releases. The intended
audience is developers and so the easiest default policy is to only
advertise the snapshots where only developers are likely to be
looking.

That said, they're not secret or confidential, and while this probably
should go to dev@, it's not a sin to mention the name of snapshots on
user@, as long as these disclaimers are clear too. I'd rather a user
understand the full picture, than find the snapshots and not
understand any of the context.

On Mon, Aug 15, 2016 at 2:11 AM, Jacek Laskowski  wrote:
> Hi Chris,
>
> With my ASF member hat on...
>
> Oh, come on, Chris. It's not "in violation of ASF policies"
> whatsoever. Policies are for ASF developers not for users. Honestly, I
> was surprised to read the note in Mark Hamstra's email. It's very
> restrictive but it says about what committers and PMCs should do not
> users:
>
> "Do not include any links on the project website that might encourage
> non-developers to download and use nightly builds, snapshots, release
> candidates, or any other similar package."
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org