Issue upgrading to Spark 2.1.1 from 2.1.0

2017-05-07 Thread mhornbech
Hi We have just tested the new Spark 2.1.1 release, and observe an issue where the driver program hangs when making predictions using a random forest. The issue disappears when downgrading to 2.1.0. Have anyone observed similar issues? Recommendations on how to dig into this would also be much

Re: New runtime exception after switch to Spark 2.1.0

2017-01-18 Thread mhornbech
For anyone revisiting this at a later point, the issue was that Spark 2.1.0 upgrades netty to version 4.0.42 which is not binary compatible with version 4.0.37 used by version 3.1.0 of the Cassandra Java Driver. The newer version can work with Cassandra, but because of differences in the maven

Re: Error: PartitioningCollection requires all of its partitionings have the same numPartitions.

2017-01-04 Thread mhornbech
I am also experiencing this. Do you have a JIRA on it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-PartitioningCollection-requires-all-of-its-partitionings-have-the-same-numPartitions-tp27875p28272.html Sent from the Apache Spark User List mailing

New runtime exception after switch to Spark 2.1.0

2016-12-31 Thread mhornbech
Hi We just tested a switch from Spark 2.0.2 to Spark 2.1.0 on our codebase. It compiles fine, but introduces the following runtime exception upon initialization of our Cassandra database. I can't find any clues in the release notes. Has anyone experienced this? Morten sbt.ForkMain$ForkError:

Random Forest hangs without trace of error

2016-12-09 Thread mhornbech
Hi I have spent quite some time trying to debug an issue with the Random Forest algorithm on Spark 2.0.2. The input dataset is relatively large at around 600k rows and 200MB, but I use subsampling to make each tree manageable. However even with only 1 tree and a low sample rate of 0.05 the job

Any estimate for a Spark 2.0.1 release date?

2016-09-05 Thread mhornbech
I can't find any JIRA issues with the tag that are unresolved. Apologies if this is a rookie mistake and the information is available elsewhere. Morten -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-estimate-for-a-Spark-2-0-1-release-date-tp27659.html

Re: Spark 2.0 regression when querying very wide data frames

2016-08-22 Thread mhornbech
I dont think thats the issue. It sound very much like this https://issues.apache.org/jira/browse/SPARK-16664 Morten > Den 20. aug. 2016 kl. 21.24 skrev ponkin [via Apache Spark User List] > : > > Did you try to load wide, for example, CSV file or

Re: Spark 2.0 regression when querying very wide data frames

2016-08-20 Thread mhornbech
Cassandra. Morten > Den 20. aug. 2016 kl. 13.53 skrev ponkin [via Apache Spark User List] > : > > Hi, > What kind of datasource do you have? CSV, Avro, Parquet? > > If you reply to this email, your message will be added to the discussion > below: >

Re: Spark 2.0 regression when querying very wide data frames

2016-08-19 Thread mhornbech
I did some extra digging. Running the query "select column1 from myTable" I can reproduce the problem on a frame with a single row - it occurs exactly when the frame has more than 200 columns, which smells a bit like a hardcoded limit. Interestingly the problem disappears when replacing the query

Spark 2.0 regression when querying very wide data frames

2016-08-19 Thread mhornbech
Hi We currently have some workloads in Spark 1.6.2 with queries operating on a data frame with 1500+ columns (17000 rows). This has never been quite stable, and some queries, such as "select *" would yield empty result sets, but queries restricting to specific columns have mostly worked. Needless

Using Spark 2.0 inside Docker

2016-08-04 Thread mhornbech
Hi We are currently running a setup with Spark 1.6.2 inside Docker. It requires the use of the HTTPBroadcastFactory instead of the default TorrentBroadcastFactory to avoid the use of random ports, that cannot be exposed through Docker. From the Spark 2.0 release notes I can see that the