Fwd: Saving large textfile

2016-04-24 Thread Simon Hafner
2016-04-24 13:38 GMT+02:00 Stefan Falk :
> sc.parallelize(cfile.toString()
>   .split("\n"), 1)
Try `sc.textFile(pathToFile)` instead.

>java.io.IOException: Broken pipe
>at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>...

That sounds like something's crashing. Maybe OOM? Don't use spark to
aggregate into a single text file, do that with something else (e.g.
cat) later.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: StreamCorruptedException during deserialization

2016-03-29 Thread Simon Hafner
2016-03-29 11:25 GMT+02:00 Robert Schmidtke :
> Is there a meaningful way for me to find out what exactly is going wrong
> here? Any help and hints are greatly appreciated!
Maybe a version mismatch between the jars on the cluster?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Output is being stored on the clusters (slaves).

2016-03-24 Thread Simon Hafner
2016-03-24 11:09 GMT+01:00 Shishir Anshuman :
> I am using two Slaves to run the ALS algorithm. I am saving the predictions
> in a textfile using :
>   saveAsTextFile(path)
>
> The predictions is getting stored on the slaves but I want the predictions
> to be saved on the Master.
Yes, that is expected behavior. `path` is resolved on the machine it
is executed, the slaves. I'd recommend to either use a cluster FS
(e.g. HDFS) or .collect() your data so you can save them locally on
the master. Beware of OOM if your data is big.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No active SparkContext

2016-03-24 Thread Simon Hafner
2016-03-24 9:54 GMT+01:00 Max Schmidt :
> we're using with the java-api (1.6.0) a ScheduledExecutor that
continuously
> executes a SparkJob to a standalone cluster.
I'd recommend Scala.

> After each job we close the JavaSparkContext and create a new one.
Why do that? You can happily reuse it. Pretty sure that also causes
the other problems, because you have a race condition on waiting for
the job to finish and stopping the Context.


Re: Installing Spark on Mac

2016-03-04 Thread Simon Hafner
I'd try `brew install spark` or `apache-spark` and see where that gets
you. https://github.com/Homebrew/homebrew

2016-03-04 21:18 GMT+01:00 Aida :
> Hi all,
>
> I am a complete novice and was wondering whether anyone would be willing to
> provide me with a step by step guide on how to install Spark on a Mac; on
> standalone mode btw.
>
> I downloaded a prebuilt version, the second version from the top. However, I
> have not installed Hadoop and am not planning to at this stage.
>
> I also downloaded Scala from the Scala website, do I need to download
> anything else?
>
> I am very eager to learn more about Spark but am unsure about the best way
> to do it.
>
> I would be happy for any suggestions or ideas
>
> Many thanks,
>
> Aida
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-on-Mac-tp26397.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running synchronized JRI code

2016-02-15 Thread Simon Hafner
2016-02-15 14:02 GMT+01:00 Sun, Rui :
> On computation, RRDD launches one R process for each partition, so there 
> won't be thread-safe issue
>
> Could you give more details on your new environment?

Running on EC2, I start the executors via

 /usr/bin/R CMD javareconf -e "/usr/lib/spark/sbin/start-master.sh"

I invoke R via roughly

object R {
  case class Element(value: Double)
  lazy val re = Option(REngine.getLastEngine()).getOrElse({
val eng = new JRI.JRIEngine()

eng.parseAndEval(scala.io.Source.fromInputStream(this.getClass().getClassLoader().getResourceAsStream("r/fit.R")).mkString)
eng
  })

  def fit(curve: Seq[Element]): Option[Fitting] = {
synchronized {
  val env = re.newEnvironment(null, false)
  re.assign("curve", new REXPDouble(curve.map(_.value).toArray), env)
  val df = re.parseAndEval("data.frame(curve=curve)", env, true)
  re.assign("df", df, env)
  val fitted = re.parseAndEval("fit(df)", env, true).asList
  if (fitted.keys == null) {
None
  } else {
val map = fitted.keys.map(key => (key,
fitted.at(key).asDouble)).toMap
Some(Fitting(map("values")))
  }
}
  }
}

where `fit` is wrapped in an UDAF.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running synchronized JRI code

2016-02-15 Thread Simon Hafner
2016-02-15 4:35 GMT+01:00 Sun, Rui :
> Yes, JRI loads an R dynamic library into the executor JVM, which faces 
> thread-safe issue when there are multiple task threads within the executor.
>
> I am thinking if the demand like yours (calling R code in RDD 
> transformations) is much desired, we may consider refactoring RRDD for this 
> purpose, although it is currently intended for internal use by SparkR and not 
> a public API.
So the RRDDs don't have that thread safety issue? I'm currently
creating a new environment for each call, but it still crashes.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Running synchronized JRI code

2016-02-14 Thread Simon Hafner
Hello

I'm currently running R code in an executor via JRI. Because R is
single-threaded, any call to R needs to be wrapped in a
`synchronized`. Now I can use a bit more than one core per executor,
which is undesirable. Is there a way to tell spark that this specific
application (or even specific UDF) needs multiple JVMs? Or should I
switch from JRI to a pipe-based (slower) setup?

Cheers,
Simon

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Serializing DataSets

2016-01-19 Thread Simon Hafner
The occasional type error if the casting goes wrong for whatever reason.

2016-01-19 1:22 GMT+08:00 Michael Armbrust <mich...@databricks.com>:
> What error?
>
> On Mon, Jan 18, 2016 at 9:01 AM, Simon Hafner <reactorm...@gmail.com> wrote:
>>
>> And for deserializing,
>> `sqlContext.read.parquet("path/to/parquet").as[T]` and catch the
>> error?
>>
>> 2016-01-14 3:43 GMT+08:00 Michael Armbrust <mich...@databricks.com>:
>> > Yeah, thats the best way for now (note the conversion is purely logical
>> > so
>> > there is no cost of calling toDF()).  We'll likely be combining the
>> > classes
>> > in Spark 2.0 to remove this awkwardness.
>> >
>> > On Tue, Jan 12, 2016 at 11:20 PM, Simon Hafner <reactorm...@gmail.com>
>> > wrote:
>> >>
>> >> What's the proper way to write DataSets to disk? Convert them to a
>> >> DataFrame and use the writers there?
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Serializing DataSets

2016-01-18 Thread Simon Hafner
And for deserializing,
`sqlContext.read.parquet("path/to/parquet").as[T]` and catch the
error?

2016-01-14 3:43 GMT+08:00 Michael Armbrust <mich...@databricks.com>:
> Yeah, thats the best way for now (note the conversion is purely logical so
> there is no cost of calling toDF()).  We'll likely be combining the classes
> in Spark 2.0 to remove this awkwardness.
>
> On Tue, Jan 12, 2016 at 11:20 PM, Simon Hafner <reactorm...@gmail.com>
> wrote:
>>
>> What's the proper way to write DataSets to disk? Convert them to a
>> DataFrame and use the writers there?
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Serializing DataSets

2016-01-12 Thread Simon Hafner
What's the proper way to write DataSets to disk? Convert them to a
DataFrame and use the writers there?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Compiling spark 1.5.1 fails with scala.reflect.internal.Types$TypeError: bad symbolic reference.

2015-12-16 Thread Simon Hafner
It happens with 2.11, you'll have to do both:

./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package

you get that error if you forget one IIRC.

2015-12-05 20:17 GMT+08:00 MrAsanjar . <afsan...@gmail.com>:
> Simon, I am getting the same error, how did you resolved the problem.
>
> On Fri, Oct 16, 2015 at 9:54 AM, Simon Hafner <reactorm...@gmail.com> wrote:
>>
>> Fresh clone of spark 1.5.1, java version "1.7.0_85"
>>
>> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
>> package
>>
>> [error] bad symbolic reference. A signature in WebUI.class refers to
>> term eclipse
>> [error] in package org which is not available.
>> [error] It may be completely missing from the current classpath, or
>> the version on
>> [error] the classpath might be incompatible with the version used when
>> compiling WebUI.class.
>> [error] bad symbolic reference. A signature in WebUI.class refers to term
>> jetty
>> [error] in value org.eclipse which is not available.
>> [error] It may be completely missing from the current classpath, or
>> the version on
>> [error] the classpath might be incompatible with the version used when
>> compiling WebUI.class.
>> [error]
>> [error]  while compiling:
>> /root/spark/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
>> [error] during phase: erasure
>> [error]  library version: version 2.10.4
>> [error] compiler version: version 2.10.4
>> [error]   reconstructed args: -deprecation -classpath
>>
>> /root/spark/sql/core/target/scala-2.10/classes:/root/.m2/repository/org/apache/spark/spark-core_2.10/1.6.0-SNAPSHOT/spark-core_2.10-1.6.0-SNAPSHOT.jar:/root/.m
>>
>> 2/repository/org/apache/avro/avro-mapred/1.7.7/avro-mapred-1.7.7-hadoop2.jar:/root/.m2/repository/org/apache/avro/avro-ipc/1.7.7/avro-ipc-1.7.7.jar:/root/.m2/repository/org/apache/avro/avro-ipc/1.7.7/avro-ipc-1.7.
>>
>> 7-tests.jar:/root/.m2/repository/com/twitter/chill_2.10/0.5.0/chill_2.10-0.5.0.jar:/root/.m2/repository/com/esotericsoftware/kryo/kryo/2.21/kryo-2.21.jar:/root/.m2/repository/com/esotericsoftware/reflectasm/reflec
>>
>> tasm/1.07/reflectasm-1.07-shaded.jar:/root/.m2/repository/com/esotericsoftware/minlog/minlog/1.2/minlog-1.2.jar:/root/.m2/repository/com/twitter/chill-java/0.5.0/chill-java-0.5.0.jar:/root/.m2/repository/org/apach
>>
>> e/hadoop/hadoop-client/2.4.0/hadoop-client-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-common/2.4.0/hadoop-common-2.4.0.jar:/root/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/root/.m
>>
>> 2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/root/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/root/.m2/repository/commons-collections/commons-collections/3.2.1/commons-
>>
>> collections-3.2.1.jar:/root/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/root/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/root/.m
>>
>> 2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/root/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/root/.m2/repository/org/apac
>>
>> he/hadoop/hadoop-auth/2.4.0/hadoop-auth-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.4.0/hadoop-hdfs-2.4.0.jar:/root/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/root
>>
>> /.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.4.0/hadoop-mapreduce-client-app-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.4.0/hadoop-mapreduce-client-common-
>>
>> 2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.4.0/hadoop-yarn-client-2.4.0.jar:/root/.m2/repository/com/sun/jersey/jersey-client/1.9/jersey-client-1.9.jar:/root/.m2/repository/org/apache/ha
>>
>> doop/hadoop-yarn-server-common/2.4.0/hadoop-yarn-server-common-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.4.0/hadoop-mapreduce-client-shuffle-2.4.0.jar:/root/.m2/repository/
>>
>> org/apache/hadoop/hadoop-yarn-api/2.4.0/hadoop-yarn-api-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.4.0/hadoop-mapreduce-client-core-2.4.0.jar:/root/.m2/repository/org/apache/ha
>>
>> doop/hadoop-yarn-common/2.4.0/hadoop-yarn-common-2.4.0.jar:/root/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/root/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/root/.m2/re
>>
>> pository/org/apache/hadoop/hadoop-mapreduce-client-jobclient

Fwd: Where does mllib's .save method save a model to?

2015-11-03 Thread Simon Hafner
2015-11-03 20:26 GMT+01:00 xenocyon :
> I want to save an mllib model to disk, and am trying the model.save
> operation as described in
> http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples:
>
> model.save(sc, "myModelPath")
>
> But after running it, I am unable to find any newly created file or
> dir by the name "myModelPath" in any obvious places. Any ideas where
> it might lie?
In the hdfs configured by your spark instance. If you want to store it
in your local file system, use

file:///path/to/model

instead.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Fwd: collect() local faster than 4 node cluster

2015-11-03 Thread Simon Hafner
2015-11-03 20:07 GMT+01:00 Sebastian Kuepers
:
> Hey,
>
> with collect() RDDs elements are send as a list back to the driver.
>
> If have a 4 node cluster (based on Mesos) in a datacenter and I have my
> local dev machine.
>
> I work with a small 200MB dataset just for testing during development right
> now.
>
> The collect() tasks are running for times faster on my local machine, than
> on the cluster, although it actually uses 4x the number of cores etc.
>
> It's 7 seconds locally and 28 seconds on the cluster for the same collect()
> job.
>
> What's the reason for that? Is that just network latency sending back the
> data to the driver within the cluster? (well it's just this 200MB in total)
>
> Is that somehow a kind of 'management overhead' form Mesos?
>
> Appreciate any thoughts an possible impacts for that!
Serialization and sending over network takes time, way more than
simply processing the data on the same machine. But it doesn't scale
as well. Try with more data and plot the results.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Support Ordering on UserDefinedType

2015-11-03 Thread Simon Hafner
2015-11-03 23:20 GMT+01:00 Ionized :
> TypeUtils.getInterpretedOrdering currently only supports AtomicType and
> StructType. Is it possible to add support for UserDefinedType as well?
Yes, make a PR to spark.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala#L57-L62

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Compiling spark 1.5.1 fails with scala.reflect.internal.Types$TypeError: bad symbolic reference.

2015-10-16 Thread Simon Hafner
Fresh clone of spark 1.5.1, java version "1.7.0_85"

build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package

[error] bad symbolic reference. A signature in WebUI.class refers to
term eclipse
[error] in package org which is not available.
[error] It may be completely missing from the current classpath, or
the version on
[error] the classpath might be incompatible with the version used when
compiling WebUI.class.
[error] bad symbolic reference. A signature in WebUI.class refers to term jetty
[error] in value org.eclipse which is not available.
[error] It may be completely missing from the current classpath, or
the version on
[error] the classpath might be incompatible with the version used when
compiling WebUI.class.
[error]
[error]  while compiling:
/root/spark/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
[error] during phase: erasure
[error]  library version: version 2.10.4
[error] compiler version: version 2.10.4
[error]   reconstructed args: -deprecation -classpath
/root/spark/sql/core/target/scala-2.10/classes:/root/.m2/repository/org/apache/spark/spark-core_2.10/1.6.0-SNAPSHOT/spark-core_2.10-1.6.0-SNAPSHOT.jar:/root/.m
2/repository/org/apache/avro/avro-mapred/1.7.7/avro-mapred-1.7.7-hadoop2.jar:/root/.m2/repository/org/apache/avro/avro-ipc/1.7.7/avro-ipc-1.7.7.jar:/root/.m2/repository/org/apache/avro/avro-ipc/1.7.7/avro-ipc-1.7.
7-tests.jar:/root/.m2/repository/com/twitter/chill_2.10/0.5.0/chill_2.10-0.5.0.jar:/root/.m2/repository/com/esotericsoftware/kryo/kryo/2.21/kryo-2.21.jar:/root/.m2/repository/com/esotericsoftware/reflectasm/reflec
tasm/1.07/reflectasm-1.07-shaded.jar:/root/.m2/repository/com/esotericsoftware/minlog/minlog/1.2/minlog-1.2.jar:/root/.m2/repository/com/twitter/chill-java/0.5.0/chill-java-0.5.0.jar:/root/.m2/repository/org/apach
e/hadoop/hadoop-client/2.4.0/hadoop-client-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-common/2.4.0/hadoop-common-2.4.0.jar:/root/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/root/.m
2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/root/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/root/.m2/repository/commons-collections/commons-collections/3.2.1/commons-
collections-3.2.1.jar:/root/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/root/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/root/.m
2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/root/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/root/.m2/repository/org/apac
he/hadoop/hadoop-auth/2.4.0/hadoop-auth-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.4.0/hadoop-hdfs-2.4.0.jar:/root/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/root
/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.4.0/hadoop-mapreduce-client-app-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.4.0/hadoop-mapreduce-client-common-
2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.4.0/hadoop-yarn-client-2.4.0.jar:/root/.m2/repository/com/sun/jersey/jersey-client/1.9/jersey-client-1.9.jar:/root/.m2/repository/org/apache/ha
doop/hadoop-yarn-server-common/2.4.0/hadoop-yarn-server-common-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.4.0/hadoop-mapreduce-client-shuffle-2.4.0.jar:/root/.m2/repository/
org/apache/hadoop/hadoop-yarn-api/2.4.0/hadoop-yarn-api-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.4.0/hadoop-mapreduce-client-core-2.4.0.jar:/root/.m2/repository/org/apache/ha
doop/hadoop-yarn-common/2.4.0/hadoop-yarn-common-2.4.0.jar:/root/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/root/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/root/.m2/re
pository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.4.0/hadoop-mapreduce-client-jobclient-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-annotations/2.4.0/hadoop-annotations-2.4.0.jar:/root/.m2
/repository/org/apache/spark/spark-launcher_2.10/1.6.0-SNAPSHOT/spark-launcher_2.10-1.6.0-SNAPSHOT.jar:/root/.m2/repository/org/apache/spark/spark-network-common_2.10/1.6.0-SNAPSHOT/spark-network-common_2.10-1.6.0
-SNAPSHOT.jar:/root/.m2/repository/org/apache/spark/spark-network-shuffle_2.10/1.6.0-SNAPSHOT/spark-network-shuffle_2.10-1.6.0-SNAPSHOT.jar:/root/.m2/repository/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldb
jni-all-1.8.jar:/root/.m2/repository/org/apache/spark/spark-unsafe_2.10/1.6.0-SNAPSHOT/spark-unsafe_2.10-1.6.0-SNAPSHOT.jar:/root/.m2/repository/net/java/dev/jets3t/jets3t/0.9.3/jets3t-0.9.3.jar:/root/.m2/reposito

udaf with multiple return values in spark 1.5.0

2015-09-06 Thread Simon Hafner
Hi everyone

is it possible to return multiple values with an udaf defined in spark
1.5.0? The documentation [1] mentions

abstract def dataType: DataType
The DataType of the returned value of this UserDefinedAggregateFunction.

so it's only possible to return a single value. Should I use ArrayType
as a WA here? The returned values are all doubles.

Cheers

[1] 
https://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedAggregateFunction

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



wholeTextFiles on 20 nodes

2014-11-23 Thread Simon Hafner
I have 20 nodes via EC2 and an application that reads the data via
wholeTextFiles. I've tried to copy the data into hadoop via
copyFromLocal, and I get

14/11/24 02:00:07 INFO hdfs.DFSClient: Exception in
createBlockOutputStream 172.31.2.209:50010 java.io.IOException: Bad
connect ack with firstBadLink as X:50010
14/11/24 02:00:07 INFO hdfs.DFSClient: Abandoning block
blk_-8725559184260876712_2627
14/11/24 02:00:07 INFO hdfs.DFSClient: Excluding datanode X:50010

a lot. Then I went the file system route via copy-dir, which worked
well. Now everything is under /root/txt on all nodes. I submitted the
job with the file:///root/txt/ directory for wholeTextFiles() and I
get

Exception in thread main java.io.FileNotFoundException: File does
not exist: /root/txt/3521.txt

The file exists on the root note and should be everywhere according to
copy-dir. The hadoop variant worked fine with 3 nodes, but it starts
bugging with 20. I added

  property
namedfs.datanode.max.transfer.threads/name
value4096/value
  /property

to hdfs-site.xml and core-site.xml, didn't help.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



log4j logging control via sbt

2014-11-05 Thread Simon Hafner
I've tried to set the log4j logger to warn only via log4j properties file in

cat src/test/resources/log4j.properties
log4j.logger.org.apache.spark=WARN

or in sbt via

javaOptions += -Dlog4j.logger.org.apache.spark=WARN

But the logger still gives me INFO messages to stdout when I run my tests via

sbt test

Is it the wrong option? I also tried

javaOptions += -Dlog4j.rootLogger=warn

but that doesn't seem to help either.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark with HLists

2014-10-29 Thread Simon Hafner
I tried using shapeless HLists as data storage for data inside spark.
Unsurprisingly, it failed. The deserialization isn't well-defined because of
all the implicits used by shapeless. How could I make it work?

Sample Code:

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import shapeless._
import ops.hlist._

object SimpleApp {
  def main(args: Array[String]) {
val logFile = /tmp/README.md // Should be some file on your system
val conf = new SparkConf().setAppName(Simple Application)
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData
  .map(line = line :: HNil)
  .filter(_.select[String].contains(a))
  .count()
println(Lines with a: %s.format(numAs))
  }
}

Error:

Exception in thread main java.lang.NoClassDefFoundError:
shapeless/$colon$colon
at SimpleApp$.main(SimpleApp.scala:15)
at SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org