Issue upgrading to Spark 2.1.1 from 2.1.0

2017-05-07 Thread mhornbech
Hi

We have just tested the new Spark 2.1.1 release, and observe an issue where
the driver program hangs when making predictions using a random forest. The
issue disappears when downgrading to 2.1.0.

Have anyone observed similar issues? Recommendations on how to dig into this
would also be much appreciated. The driver program seemingly hangs (no
messages in the log and no running spark jobs) with a constant 100% cpu
usage.

Morten



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-upgrading-to-Spark-2-1-1-from-2-1-0-tp28660.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: New runtime exception after switch to Spark 2.1.0

2017-01-18 Thread mhornbech
For anyone revisiting this at a later point, the issue was that Spark 2.1.0
upgrades netty to version 4.0.42 which is not binary compatible with version
4.0.37 used by version 3.1.0 of the Cassandra Java Driver. The newer version
can work with Cassandra, but because of differences in the maven artifacts
(Spark depends on netty-all while Cassandra depends on netty-transport) this
was not automatically resolved by SBT. Adding an explicit dependency to
netty-transport version 4.0.42 solved the problem.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-runtime-exception-after-switch-to-Spark-2-1-0-tp28263p28319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Error: PartitioningCollection requires all of its partitionings have the same numPartitions.

2017-01-04 Thread mhornbech
I am also experiencing this. Do you have a JIRA on it?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-PartitioningCollection-requires-all-of-its-partitionings-have-the-same-numPartitions-tp27875p28272.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



New runtime exception after switch to Spark 2.1.0

2016-12-31 Thread mhornbech
Hi

We just tested a switch from Spark 2.0.2 to Spark 2.1.0 on our codebase. It
compiles fine, but introduces the following runtime exception upon
initialization of our Cassandra database. I can't find any clues in the
release notes. Has anyone experienced this?

Morten

sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at
com.datastax.driver.core.NettyUtil.newEventLoopGroupInstance(NettyUtil.java:136)
at
com.datastax.driver.core.NettyOptions.eventLoopGroup(NettyOptions.java:96)
at 
com.datastax.driver.core.Connection$Factory.(Connection.java:713)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1375)
at com.datastax.driver.core.Cluster.init(Cluster.java:163)
at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:334)
at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:309)
at com.datastax.driver.core.Cluster.connect(Cluster.java:251)
at
com.websudos.phantom.connectors.DefaultSessionProvider$$anonfun$3$$anonfun$4.apply(DefaultSessionProvider.scala:66)
at
com.websudos.phantom.connectors.DefaultSessionProvider$$anonfun$3$$anonfun$4.apply(DefaultSessionProvider.scala:66)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.package$.blocking(package.scala:123)
at
com.websudos.phantom.connectors.DefaultSessionProvider$$anonfun$3.apply(DefaultSessionProvider.scala:65)
at
com.websudos.phantom.connectors.DefaultSessionProvider$$anonfun$3.apply(DefaultSessionProvider.scala:64)
at scala.util.Try$.apply(Try.scala:192)
at
com.websudos.phantom.connectors.DefaultSessionProvider.createSession(DefaultSessionProvider.scala:64)
at
com.websudos.phantom.connectors.DefaultSessionProvider.(DefaultSessionProvider.scala:81)
at com.websudos.phantom.connectors.KeySpaceDef.(Keyspace.scala:92)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-runtime-exception-after-switch-to-Spark-2-1-0-tp28263.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Random Forest hangs without trace of error

2016-12-09 Thread mhornbech
Hi

I have spent quite some time trying to debug an issue with the Random Forest
algorithm on Spark 2.0.2. The input dataset is relatively large at around
600k rows and 200MB, but I use subsampling to make each tree manageable.
However even with only 1 tree and a low sample rate of 0.05 the job hangs at
one of the final stages (see attached). I have checked the logs on all
executors and the driver and find no traces of error. Could it be a memory
issue even though no error appears? The error does seem sporadic to some
extent so I also wondered whether it could be a data issue, that only occurs
if the subsample includes the bad data rows. 

Please comment if you have a clue.

Morten


 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Any estimate for a Spark 2.0.1 release date?

2016-09-05 Thread mhornbech
I can't find any JIRA issues with the tag that are unresolved. Apologies if
this is a rookie mistake and the information is available elsewhere.

Morten



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Any-estimate-for-a-Spark-2-0-1-release-date-tp27659.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.0 regression when querying very wide data frames

2016-08-22 Thread mhornbech
I dont think thats the issue. It sound very much like this 
https://issues.apache.org/jira/browse/SPARK-16664

Morten

> Den 20. aug. 2016 kl. 21.24 skrev ponkin [via Apache Spark User List] 
> :
> 
> Did you try to load wide, for example, CSV file or Parquet? May be the 
> problem is in spark-cassandra-connector not Spark itself? Are you using 
> spark-cassandra-connector(https://github.com/datastax/spark-cassandra-connector)?
>  
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27571.html
> To unsubscribe from Spark 2.0 regression when querying very wide data frames, 
> click here.
> NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 2.0 regression when querying very wide data frames

2016-08-20 Thread mhornbech
Cassandra. 

Morten

> Den 20. aug. 2016 kl. 13.53 skrev ponkin [via Apache Spark User List] 
> :
> 
> Hi, 
> What kind of datasource do you have? CSV, Avro, Parquet? 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27569.html
> To unsubscribe from Spark 2.0 regression when querying very wide data frames, 
> click here.
> NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 2.0 regression when querying very wide data frames

2016-08-19 Thread mhornbech
I did some extra digging. Running the query "select column1 from myTable" I
can reproduce the problem on a frame with a single row - it occurs exactly
when the frame has more than 200 columns, which smells a bit like a
hardcoded limit.

Interestingly the problem disappears when replacing the query with "select
column1 from myTable limit N" where N is arbitrary. However it appears again
when running "select * from myTable limit N" with sufficiently many columns
(haven't determined the exact threshold here).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27568.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2.0 regression when querying very wide data frames

2016-08-19 Thread mhornbech
Hi

We currently have some workloads in Spark 1.6.2 with queries operating on a
data frame with 1500+ columns (17000 rows). This has never been quite
stable, and some queries, such as "select *" would yield empty result sets,
but queries restricting to specific columns have mostly worked. Needless to
say that 1500+ columns isn't "desirable", but that's what the client's data
looks like and our preference have been to load it and normalize it through
Spark.

We have been waiting to see how this would work with Spark 2.0, and
unfortunately the problem has gotten worse. Almost all queries on this large
data frame that worked before will now return data frames with only null
values.

Is this a known issue with Spark? If yes, does anyone know why it has been
left untouched / made worse in Spark 2.0? If data frames with many columns
is a limitation that goes deep into Spark, I would prefer hard errors rather
than queries that run with meaningless results. The problem is easy to
reproduce, but I am not familiar enough debugging the Spark source code to
find the root cause. 

Hope some of you can enlighten me :-)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Using Spark 2.0 inside Docker

2016-08-04 Thread mhornbech
Hi

We are currently running a setup with Spark 1.6.2 inside Docker. It requires
the use of the HTTPBroadcastFactory instead of the default
TorrentBroadcastFactory to avoid the use of random ports, that cannot be
exposed through Docker. From the Spark 2.0 release notes I can see that the
HTTPBroadcast option has been removed. Are there any alternative means of
running a Spark 2.0 cluster in Docker?

Morten





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-2-0-inside-Docker-tp27475.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org