Re: Spark speed performance

2014-10-19 Thread jan.zikes
Thank you very much lot of very small json files was exactly the speed 
performance problem, using coalesce makes my Spark program to run on single 
node only twice slower (even with starting Spark) than single node Python 
program, which is acceptable.

Jan 
__

Because the overhead between JVM and Python, single task will be
slower than your local Python scripts, but it's very easy to scale to
many CPUs.

Even one CPUs, it's not common that PySpark was 100 times slower. You
have many small files, each file will be processed by a task, which
will have about 100ms overhead (scheduled and executed), but the small
file can be processed in your single thread Python script in less than
1ms.

You could pack your json files into larger ones, or you could try to
merge the small tasks into larger one by coalesce(N), such as:

distData = sc.textFile(sys.argv[2]).coalesce(10)  # which will have 10
partitons (tasks)

Davies

On Sat, Oct 18, 2014 at 12:07 PM,   wrote:
> Hi,
>
> I have program that I have for single computer (in Python) exection and also
> implemented the same for Spark. This program basically only reads .json from
> which it takes one field and saves it back. Using Spark my program runs
> aproximately 100 times slower on 1 master and 1 slave. So I would like to
> ask where possibly might be the problem?
>
> My Spark program looks like:
>
>
>
> sc = SparkContext(appName="Json data preprocessor")
>
> distData = sc.textFile(sys.argv[2])
>
> json_extractor = JsonExtractor(sys.argv[1])
>
> cleanedData = distData.flatMap(json_extractor.extract_json)
>
> cleanedData.saveAsTextFile(sys.argv[3])
>
> JsonExtractor only selects the data from field that is given by sys.argv[1].
>
>
>
> My data are basically many small one json files, where is one json per line.
>
> I have tried both, reading and writing the data from/to Amazon S3, local
> disc on all the machines.
>
> I would like to ask if there is something that I am missing or if Spark is
> supposed to be so slow in comparison with the local non parallelized single
> node program.
>
>
>
> Thank you in advance for any suggestions or hints.
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



why does driver connects to master fail ?

2014-10-19 Thread randylu
  In my programer, the application always connects to master fail for
serveral iterations. The driver' log is as follows:
WARN AppClient$ClientActor: Connection to
akka.tcp://sparkMaster@master1:7077 failed; waiting for master to
reconnect...
  why does this warnning happen and how to avoid it?
  Best,
  randylu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-does-driver-connects-to-master-fail-tp16758.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why does driver connects to master fail ?

2014-10-19 Thread randylu
  In additional, driver receives serveral DisassociatedEvent messages.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-does-driver-connects-to-master-fail-tp16758p16759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What's wrong with my spark filter? I get "org.apache.spark.SparkException: Task not serializable"

2014-10-19 Thread Ilya Ganelin
Check for any variables you've declared in your class. Even if you're not
calling them from the function they are passed to the worker nodes as part
of the context. Consequently, if you have something without a default
serializer (like an imported class) it will also get passed.

To fix this you can either move that variable out of the class (make it
global) or you can implement kryo serialization (see the Spark tuning guide
for this).
On Oct 17, 2014 6:37 AM, "shahab"  wrote:

> Hi,
>
> Probably I am missing very simple principle , but something is wrong with
> my filter,
> i get "org.apache.spark.SparkException: Task not serializable" expetion.
>
> here is my filter function:
> object OBJ {
>def f1(): Boolean = {
>  var i = 1;
>  for (j<-1 to 10) i = i +1;
>  true;
>}
> }
>
> rdd.filter(row => OBJ.f1())
>
>
> And when I run, I get the following exception:
>
> org.apache.spark.SparkException: Task not serializable
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> at org.apache.spark.rdd.RDD.filter(RDD.scala:282)
> ...
> Caused by: java.io.NotSerializableException: org.apache.spark.SparkConf
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> ...
>
>
>
> best,
> /Shahab
>
>


scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Timestamp getTime() {
return time;
}
public void setTime(Timestamp time) {
this.time = time;
}
}

@Test
public void testTimeStamp() {
JavaSparkContext sc = new 
JavaSparkContext("local", "timestamp");
String[] data = {"1,2014-01-01", 
"2,2014-02-01"};
JavaRDD input = 
sc.parallelize(Arrays.asList(data));
JavaRDD events = input.map(new 
Function() {
public Event call(String arg0) 
throws Exception {
String[] c = 
arg0.split(",");
Event e = new 
Event();
e.setName(c[0]);
DateFormat fmt 
= new SimpleDateFormat("-MM-dd");
e.setTime(new 
Timestamp(fmt.parse(c[1]).getTime()));
return e;
}
});

JavaSQLContext sqlCtx = new JavaSQLContext(sc);
JavaSchemaRDD schemaEvent = 
sqlCtx.applySchema(events, Event.class);
schemaEvent.registerTempTable("event");

sc.stop();
}


RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Wang, Daoyuan
Can you provide the exception stack?

Thanks,
Daoyuan

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 10:17 PM
To: user@spark.apache.org
Subject: scala.MatchError: class java.sql.Timestamp

I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Timestamp getTime() {
return time;
}
public void setTime(Timestamp time) {
this.time = time;
}
}

@Test
public void testTimeStamp() {
JavaSparkContext sc = new 
JavaSparkContext("local", "timestamp");
String[] data = {"1,2014-01-01", 
"2,2014-02-01"};
JavaRDD input = 
sc.parallelize(Arrays.asList(data));
JavaRDD events = input.map(new 
Function() {
public Event call(String arg0) 
throws Exception {
String[] c = 
arg0.split(",");
Event e = new 
Event();
e.setName(c[0]);
DateFormat fmt 
= new SimpleDateFormat("-MM-dd");
e.setTime(new 
Timestamp(fmt.parse(c[1]).getTime()));
return e;
}
});

JavaSQLContext sqlCtx = new JavaSQLContext(sc);
JavaSchemaRDD schemaEvent = 
sqlCtx.applySchema(events, Event.class);
schemaEvent.registerTempTable("event");

sc.stop();
}


RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
scala.MatchError: class java.sql.Timestamp (of class java.lang.Class)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
at 
org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
at 
com.ford.dtc.ff.SessionStats.testTimeStamp(SessionStats.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)


From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Sunday, October 19, 2014 10:31 AM
To: Ge, Yao (Y.); user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

Can you provide the exception stack?

Thanks,
Daoyuan

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 10:17 PM
To: user@spark.apache.org
Subject: scala.MatchError: class java.sql.Timestamp

I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Timestamp getTime() {
return time;
}
public void setTime(Timestamp time) {
this.time = time;
}
}

@Test
public void testTimeStamp() {
JavaSparkContext sc = new 
JavaSparkConte

Error while running Streaming examples - no snappyjava in java.library.path

2014-10-19 Thread bdev
I built the latest Spark project and I'm running into these errors when
attempting to run the streaming examples locally on the Mac, how do I fix
these errors?

java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at java.lang.System.loadLibrary(System.java:1088)
at 
org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:170)
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:145)
at org.xerial.snappy.Snappy.(Snappy.java:47)
at 
org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:81)
at
org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
at
org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1083)
at
org.apache.spark.storage.BlockManager$$anonfun$7.apply(BlockManager.scala:579)
at
org.apache.spark.storage.BlockManager$$anonfun$7.apply(BlockManager.scala:579)
at
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:126)
at
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:732)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:731)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at
org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:789)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:731)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:727)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:727)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-Streaming-examples-no-snappyjava-in-java-library-path-tp16765.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Using SVMWithSGD model to predict

2014-10-19 Thread npomfret
Hi, I'm new to spark and just trying to make sense of the SVMWithSGD example. 
I ran my dataset through it and build a model.  When I call predict() on the
testing data (after clearThreshold()) I was expecting to get answers in the
range of 0 to 1.  But they aren't, all predictions seem to be negative
numbers between -0 and -2.  I guess my question is what do these predictions
mean?  How are they of use?  The outcome I need is a probability rather than
a binary.Here's my java code:SparkConf conf = new SparkConf()   

.setAppName("name").set("spark.cores.max", "1");   
JavaSparkContext sc = new JavaSparkContext(conf);JavaRDD points =
sc.textFile(path).map(new ParsePoint()).cache();JavaRDD training =
points.sample(false, 0.8, 0L).cache();JavaRDD testing =
points.subtract(training);SVMModel model =
SVMWithSGD.train(training.rdd(), 100);model.clearThreshold();   
for (LabeledPoint point : testing.toArray()) {Double score =
model.predict(point.features());System.out.println("score = " +
score);//<- all these are negative numbers, seemingly between 0 and -2   
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-SVMWithSGD-model-to-predict-tp16767.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Using SVMWithSGD model to predict

2014-10-19 Thread Nick Pomfret
Hi, I'm new to spark and just trying to make sense of the SVMWithSGD
example.

I ran my dataset through it and build a model.  When I call predict() on
the testing data (after clearThreshold()) I was expecting to get answers in
the range of 0 to 1.  But they aren't, all predictions seem to be negative
numbers between -0 and -2.  I guess my question is what do these
predictions mean?  How are they of use?

The outcome I need is a probability rather than a binary.

Here's my java code:

SparkConf conf = new SparkConf()
.setAppName("name")
.set("spark.cores.max", "1");
JavaSparkContext sc = new JavaSparkContext(conf);

JavaRDD points = sc.textFile(path).map(new
ParsePoint()).cache();

JavaRDD training = points.sample(false, 0.8,
0L).cache();

JavaRDD testing = points.subtract(training);

SVMModel model = SVMWithSGD.train(training.rdd(), 100);

model.clearThreshold();

for (LabeledPoint point : testing.toArray()) {
Double score = model.predict(point.features());

System.out.println("score = " + score);//<- all these are
negative numbers, seemingly between 0 and -2
}


Re: Using SVMWithSGD model to predict

2014-10-19 Thread Sean Owen
The problem is that you called clearThreshold(). The result becomes the SVM
margin not a 0/1 class prediction. There is no probability output.

There was a very similar question last week. Is there an example out there
suggesting clearThreshold()? I also wonder if it is good to overload the
meaning of the output indirectly this way.
On Oct 19, 2014 6:53 PM, "npomfret"  wrote:

> Hi, I'm new to spark and just trying to make sense of the SVMWithSGD
> example. I ran my dataset through it and build a model. When I call
> predict() on the testing data (after clearThreshold()) I was expecting to
> get answers in the range of 0 to 1. But they aren't, all predictions seem
> to be negative numbers between -0 and -2. I guess my question is what do
> these predictions mean? How are they of use? The outcome I need is a
> probability rather than a binary. Here's my java code: SparkConf conf = new
> SparkConf() .setAppName("name") .set("spark.cores.max", "1");
> JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD points =
> sc.textFile(path).map(new ParsePoint()).cache(); JavaRDD training =
> points.sample(false, 0.8, 0L).cache(); JavaRDD testing =
> points.subtract(training); SVMModel model =
> SVMWithSGD.train(training.rdd(), 100); model.clearThreshold(); for
> (LabeledPoint point : testing.toArray()) { Double score =
> model.predict(point.features()); System.out.println("score = " +
> score);//<- all these are negative numbers, seemingly between 0 and -2 }
> --
> View this message in context: Using SVMWithSGD model to predict
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: Using SVMWithSGD model to predict

2014-10-19 Thread Nick Pomfret
Thanks.

The example I used is here
https://spark.apache.org/docs/latest/mllib-linear-methods.html see
SVMClassifier

So there's no way to get a probability based output?  What about from
linear regression, or logistic regression?

On 19 October 2014 19:52, Sean Owen  wrote:

> The problem is that you called clearThreshold(). The result becomes the
> SVM margin not a 0/1 class prediction. There is no probability output.
>
> There was a very similar question last week. Is there an example out there
> suggesting clearThreshold()? I also wonder if it is good to overload the
> meaning of the output indirectly this way.
> On Oct 19, 2014 6:53 PM, "npomfret"  wrote:
>
>> Hi, I'm new to spark and just trying to make sense of the SVMWithSGD
>> example. I ran my dataset through it and build a model. When I call
>> predict() on the testing data (after clearThreshold()) I was expecting to
>> get answers in the range of 0 to 1. But they aren't, all predictions seem
>> to be negative numbers between -0 and -2. I guess my question is what do
>> these predictions mean? How are they of use? The outcome I need is a
>> probability rather than a binary. Here's my java code: SparkConf conf = new
>> SparkConf() .setAppName("name") .set("spark.cores.max", "1");
>> JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD points =
>> sc.textFile(path).map(new ParsePoint()).cache(); JavaRDD training =
>> points.sample(false, 0.8, 0L).cache(); JavaRDD testing =
>> points.subtract(training); SVMModel model =
>> SVMWithSGD.train(training.rdd(), 100); model.clearThreshold(); for
>> (LabeledPoint point : testing.toArray()) { Double score =
>> model.predict(point.features()); System.out.println("score = " +
>> score);//<- all these are negative numbers, seemingly between 0 and -2 }
>> --
>> View this message in context: Using SVMWithSGD model to predict
>> 
>> Sent from the Apache Spark User List mailing list archive
>>  at Nabble.com.
>>
>


Re: Using SVMWithSGD model to predict

2014-10-19 Thread Sean Owen
Ah right. It is important to use clearThreshold() in that example in
order to generate margins, because the AUC metric needs the
classifications to be ranked by some relative strength, rather than
just 0/1. These outputs are not probabilities, and that is not what
SVMs give you in general. There are techniques for estimating
probabilities from SVM output but these aren't present here.

If you just want 0/1, you do not want to call clearThreshold().

Linear regression is not a classifier so probabilities don't enter
into it. Logistic regression however does give you a probability if
you compute the logistic function of the input directly.

On Sun, Oct 19, 2014 at 3:00 PM, Nick Pomfret
 wrote:
> Thanks.
>
> The example I used is here
> https://spark.apache.org/docs/latest/mllib-linear-methods.html see
> SVMClassifier
>
> So there's no way to get a probability based output?  What about from linear
> regression, or logistic regression?
>
> On 19 October 2014 19:52, Sean Owen  wrote:
>>
>> The problem is that you called clearThreshold(). The result becomes the
>> SVM margin not a 0/1 class prediction. There is no probability output.
>>
>> There was a very similar question last week. Is there an example out there
>> suggesting clearThreshold()? I also wonder if it is good to overload the
>> meaning of the output indirectly this way.
>>
>> On Oct 19, 2014 6:53 PM, "npomfret"  wrote:
>>>
>>> Hi, I'm new to spark and just trying to make sense of the SVMWithSGD
>>> example. I ran my dataset through it and build a model. When I call
>>> predict() on the testing data (after clearThreshold()) I was expecting to
>>> get answers in the range of 0 to 1. But they aren't, all predictions seem to
>>> be negative numbers between -0 and -2. I guess my question is what do these
>>> predictions mean? How are they of use? The outcome I need is a probability
>>> rather than a binary. Here's my java code: SparkConf conf = new SparkConf()
>>> .setAppName("name") .set("spark.cores.max", "1"); JavaSparkContext sc = new
>>> JavaSparkContext(conf); JavaRDD points = sc.textFile(path).map(new
>>> ParsePoint()).cache(); JavaRDD training = points.sample(false, 0.8,
>>> 0L).cache(); JavaRDD testing = points.subtract(training); SVMModel model =
>>> SVMWithSGD.train(training.rdd(), 100); model.clearThreshold(); for
>>> (LabeledPoint point : testing.toArray()) { Double score =
>>> model.predict(point.features()); System.out.println("score = " + score);//<-
>>> all these are negative numbers, seemingly between 0 and -2 }
>>> 
>>> View this message in context: Using SVMWithSGD model to predict
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What executes on worker and what executes on driver side

2014-10-19 Thread Saurabh Wadhawan
Any response for this?

1. How do I know what statements will be executed on worker side out of the 
spark script in a stage.
e.g. if I have
val x = 1 (or any other code)
in my driver code, will the same statements be executed on the worker side 
in a stage?

2. How can I do a map side join in spark :
   a. without broadcast(i.e. by reading a file once in each executor)
   b. with broadcast but by broadcasting complete RDD to each executor

Regards
- Saurabh Wadhawan



On 19-Oct-2014, at 1:54 am, Saurabh Wadhawan 
mailto:saurabh.wadha...@guavus.com>> wrote:

Hi,

 I have following questions:

1. When I write a spark script, how do I know what part runs on the driver side 
and what runs on the worker side.
So lets say, I write code to to read a plain text file.
Will it run on driver side only or will it run on server side only or on 
both sides

2. If I want each worker to load a file for lets say join and the file is 
pretty huge lets say in GBs, so that I don't want to broadcast it, then what's 
the best way to do it.
 Another way to say the same thing would be how do I load a data structure 
for fast lookup(and not an RDD) on each worker node in the executor

Regards
- Saurabh




Re: Submissions open for Spark Summit East 2015

2014-10-19 Thread Matei Zaharia
BTW several people asked about registration and student passes. Registration 
will open in a few weeks, and like in previous Spark Summits, I expect there to 
be a special pass for students.

Matei

> On Oct 18, 2014, at 9:52 PM, Matei Zaharia  wrote:
> 
> After successful events in the past two years, the Spark Summit conference 
> has expanded for 2015, offering both an event in New York on March 18-19 and 
> one in San Francisco on June 15-17. The conference is a great chance to meet 
> people from throughout the Spark community and see the latest news, tips and 
> use cases.
> 
> Submissions are now open for Spark Summit East 2015, to be held in New York 
> on March 18-19. If you’d like to give a talk on use cases, neat applications, 
> or ongoing Spark development, submit your talk online today at 
> http://prevalentdesignevents.com/sparksummit2015/east/speaker/. Submissions 
> will be open until December 6th, 2014.
> 
> If you missed this year’s Spark Summit, you can still find videos from all 
> talks online at http://spark-summit.org/2014.
> 
> Hope to see you there,
> 
> Matei


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Oryx + Spark mllib

2014-10-19 Thread Jayant Shekhar
Hi Deb,

Do check out https://github.com/OryxProject/oryx.

It does integrate with Spark. Sean has put in quite a bit of neat details
on the page about the architecture. It has all the things you are thinking
about:)

Thanks,
Jayant


On Sat, Oct 18, 2014 at 8:49 AM, Debasish Das 
wrote:

> Hi,
>
> Is someone working on a project on integrating Oryx model serving layer
> with Spark ? Models will be built using either Streaming data / Batch data
> in HDFS and cross validated with mllib APIs but the model serving layer
> will give API endpoints like Oryx
> and read the models may be from hdfs/impala/SparkSQL
>
> One of the requirement is that the API layer should be scalable and
> elastic...as requests grow we should be able to add more nodes...using play
> and akka clustering module...
>
> If there is a ongoing project on github please point to it...
>
> Is there a plan of adding model serving and experimentation layer to mllib
> ?
>
> Thanks.
> Deb
>
>
>


Is Spark the right tool?

2014-10-19 Thread kc66
I am very new to Spark.
I am work on a project that involves reading stock transactions off a number
of TCP connections and
1. periodically (once every few hours) uploads the transaction records to
HBase
2. maintains the records that are not yet written into HBase and acts as a
HTTP query server for these records. An example for a query would be to
return all transactions between 1-2pm for Google stocks for the current
trading day.

I am thinking of using Kafka to receive all the transaction records. Spark
will be the consumers of Kafka output.

In particular, I need to create a RDD hashmap with string (stock ticker
symbol) as key and list (or vector) of transaction records as data.
This RDD need to be "thread (or process) safe" since different threads and
processes will be reading and modifying it. I need insertion, deletion, and
lookup to be fast.
Is this something that can be done with Spark and is Spark the right tool to
use in terms of latency and throughput?

Pardon me if I don't know what I am talking about. All these are very new to
me.
Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-the-right-tool-tp16775.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using SVMWithSGD model to predict

2014-10-19 Thread Nick Pomfret
Thanks for the info.

On 19 October 2014 20:46, Sean Owen  wrote:

> Ah right. It is important to use clearThreshold() in that example in
> order to generate margins, because the AUC metric needs the
> classifications to be ranked by some relative strength, rather than
> just 0/1. These outputs are not probabilities, and that is not what
> SVMs give you in general. There are techniques for estimating
> probabilities from SVM output but these aren't present here.
>
> If you just want 0/1, you do not want to call clearThreshold().
>
> Linear regression is not a classifier so probabilities don't enter
> into it. Logistic regression however does give you a probability if
> you compute the logistic function of the input directly.
>
> On Sun, Oct 19, 2014 at 3:00 PM, Nick Pomfret
>  wrote:
> > Thanks.
> >
> > The example I used is here
> > https://spark.apache.org/docs/latest/mllib-linear-methods.html see
> > SVMClassifier
> >
> > So there's no way to get a probability based output?  What about from
> linear
> > regression, or logistic regression?
> >
> > On 19 October 2014 19:52, Sean Owen  wrote:
> >>
> >> The problem is that you called clearThreshold(). The result becomes the
> >> SVM margin not a 0/1 class prediction. There is no probability output.
> >>
> >> There was a very similar question last week. Is there an example out
> there
> >> suggesting clearThreshold()? I also wonder if it is good to overload the
> >> meaning of the output indirectly this way.
> >>
> >> On Oct 19, 2014 6:53 PM, "npomfret" 
> wrote:
> >>>
> >>> Hi, I'm new to spark and just trying to make sense of the SVMWithSGD
> >>> example. I ran my dataset through it and build a model. When I call
> >>> predict() on the testing data (after clearThreshold()) I was expecting
> to
> >>> get answers in the range of 0 to 1. But they aren't, all predictions
> seem to
> >>> be negative numbers between -0 and -2. I guess my question is what do
> these
> >>> predictions mean? How are they of use? The outcome I need is a
> probability
> >>> rather than a binary. Here's my java code: SparkConf conf = new
> SparkConf()
> >>> .setAppName("name") .set("spark.cores.max", "1"); JavaSparkContext sc
> = new
> >>> JavaSparkContext(conf); JavaRDD points = sc.textFile(path).map(new
> >>> ParsePoint()).cache(); JavaRDD training = points.sample(false, 0.8,
> >>> 0L).cache(); JavaRDD testing = points.subtract(training); SVMModel
> model =
> >>> SVMWithSGD.train(training.rdd(), 100); model.clearThreshold(); for
> >>> (LabeledPoint point : testing.toArray()) { Double score =
> >>> model.predict(point.features()); System.out.println("score = " +
> score);//<-
> >>> all these are negative numbers, seemingly between 0 and -2 }
> >>> 
> >>> View this message in context: Using SVMWithSGD model to predict
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >
> >
>


mlib model build and low CPU usage

2014-10-19 Thread Nick Pomfret
I'm building a model in a stand alone cluster with just a single worker
limited to use 3 cores and 4GB ram.  The node starts up and spits out the
message:

Starting Spark worker 192.168.1.185:60203 with 3 cores, 4.0 GB RAM

During the model train (SVMWithSGD) the CPU on the worker is very low.  It
barley gets above 25% of a single core.  I've tried adjusting the number of
cores but no matter what it the CPU usage seems to average around 25% (of 1
just core).

The machine is running OS x with a quad core i7.

Is this expected behaviour?  It feels like the cpu is massively under
utilised.


Spark Streaming scheduling control

2014-10-19 Thread davidkl
Hello,

I have a cluster 1 master and 2 slaves running on 1.1.0. I am having
problems to get both slaves working at the same time. When I launch the
driver on the master, one of the slaves is assigned the receiver task, and
initially both slaves start processing tasks. After a few tens of batches,
the slave running the receiver starts processing all tasks, and the other
won't execute any task more. If I cancel the execution and start over, the
roles may switch if the other slave gets to be assigned the receiver, but
the behaviour is the same, and the other slave will stop processing tasks
after a short while. So both slaves are working, essentially, but never at
the same time in a consistent way. No errors on logs, etc.

I have tried increasing partitions (up to 100, while slaves have 4 cores
each) but no success :-/

I understand that Spark may decide not to distribute tasks to all workers
due to data locality, etc. but in this case I think there is something else,
since one slave cannot keep up with the processing rate and the total delay
keeps growing: I have set up the batch interval to 1s, but each batch is
processed in 1.6s so after some time the delay (and the enqueued data) is
just too much. 

Does Spark take into consideration this time restriction on the scheduling?
I mean total processing time <= batch duration. Any configuration affecting
that? 

Am I missing something important? Any hints or things to tests?  

Thanks in advance! ;-)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-scheduling-control-tp16778.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Upgrade to Spark 1.1.0?

2014-10-19 Thread Pat Ferrel
Trying to upgrade from Spark 1.0.1 to 1.1.0. Can’t imagine the upgrade is the 
problem but anyway...

I get a NoClassDefFoundError for RandomGenerator when running a driver from the 
CLI. But only when using a named master, even a standalone master. If I run 
using master = local[4] the job executes correctly but if I set the master to 
spark://Maclaurin.local:7077 though they are the same machine I get the 
NoClassDefFoundError. The classpath seems correct on the CLI and the jars do 
indeed contain the offending class (see below). There must be some difference 
in how classes are loaded between local[4] and spark://Maclaurin.local:7077?

Any ideas?

===

The driver is in mahout-spark_2.10-1.0-SNAPSHOT-job.jar so it’s execution means 
it must be in the classpath. When I look at what’s in the jar I see 
RandomGenerator.

Maclaurin:target pat$ jar tf mahout-spark_2.10-1.0-SNAPSHOT-job.jar | grep 
RandomGenerator
cern/jet/random/engine/RandomGenerator.class
org/apache/commons/math3/random/GaussianRandomGenerator.class
org/apache/commons/math3/random/JDKRandomGenerator.class
org/apache/commons/math3/random/UniformRandomGenerator.class
org/apache/commons/math3/random/RandomGenerator.class  <==!
org/apache/commons/math3/random/NormalizedRandomGenerator.class
org/apache/commons/math3/random/AbstractRandomGenerator.class
org/apache/commons/math3/random/StableRandomGenerator.class

But get the following error executing the job:

14/10/19 15:39:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 6.9 
(TID 84, 192.168.0.2): java.lang.NoClassDefFoundError: 
org/apache/commons/math3/random/RandomGenerator
org.apache.mahout.common.RandomUtils.getRandom(RandomUtils.java:65)

org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$5.apply(SimilarityAnalysis.scala:272)

org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$5.apply(SimilarityAnalysis.scala:267)

org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)

org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
java.lang.Thread.run(Thread.java:695)



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: how to build spark 1.1.0 to include org.apache.commons.math3 ?

2014-10-19 Thread Henry Hung
@Sean Owen,

Thank you for the information.

I change the pom file to include math3, because I needed the math3 library from 
my previous use with 1.0.2.

Best regards,
Henry

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Saturday, October 18, 2014 2:19 AM
To: MA33 YTHung1
Cc: user@spark.apache.org
Subject: Re: how to build spark 1.1.0 to include org.apache.commons.math3 ?

It doesn't contain commons math3 since Spark does not depend on it.
Its tests do, but tests are not built into the Spark assembly.

On Thu, Oct 16, 2014 at 9:57 PM, Henry Hung  wrote:
> HI All,
>
>
>
> I try to build spark 1.1.0 using sbt with command:
>
> sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly
>
>
>
> but the resulting spark-assembly-1.1.0-hadoop2.2.0.jar  still missing
> the apache commons math3 classes.
>
>
>
> How to add the math3 into package?
>
>
>
> Best regards,
>
> Henry
>
>
> 
> The privileged confidential information contained in this email is
> intended for use only by the addressees as indicated by the original
> sender of this email. If you are not the addressee indicated in this
> email or are not responsible for delivery of the email to such a
> person, please kindly reply to the sender indicating this fact and
> delete all copies of it from your computer and network server
> immediately. Your cooperation is highly appreciated. It is advised
> that any unauthorized use of confidential information of Winbond is
> strictly prohibited; and any information in this email irrelevant to
> the official business of Winbond shall be deemed as neither given nor 
> endorsed by Winbond.

The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.


RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Cheng, Hao
Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of 
the data types supported by Catalyst.

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 11:44 PM
To: Wang, Daoyuan; user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

scala.MatchError: class java.sql.Timestamp (of class java.lang.Class)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
at 
org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
at 
com.ford.dtc.ff.SessionStats.testTimeStamp(SessionStats.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)


From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Sunday, October 19, 2014 10:31 AM
To: Ge, Yao (Y.); user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

Can you provide the exception stack?

Thanks,
Daoyuan

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 10:17 PM
To: user@spark.apache.org
Subject: scala.MatchError: class java.sql.Timestamp

I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Timestamp getTime() {
return time;
}
   

All executors run on just a few nodes

2014-10-19 Thread Tao Xiao
Hi all,

I have a Spark-0.9 cluster, which has 16 nodes.

I wrote a Spark application to read data from an HBase table, which has 86
regions spreading over 20 RegionServers.

I submitted the Spark app in Spark standalone mode and found that there
were 86 executors running on just 3 nodes and it took about  30 minutes to
read data from the table. In this case, I noticed from Spark master UI
that Locality
Level of all executors are "PROCESS_LOCAL".

Later I ran the same app again (without any code changed) and found that
those 86 executors were running on 16 nodes, and this time it took just 4
minutes to read date from the same HBase table. In this case, I noticed
that Locality Level of most executors are "NODE_LOCAL".

After testing multiple times, I found the two cases above occur randomly.

So I have 2 questions:
1)  Why would the two cases above occur randomly when I submitted the same
application multiple times ?
2)  Would the spread of executors influence locality level ?

Thank you.


Re: All executors run on just a few nodes

2014-10-19 Thread raymond
My best guess is the speed that your executors got registered with driver 
differs between each run.

when you run it for the first time, the executors is not fully registered when 
task set manager start to assign tasks, and thus the tasks was assigned to 
available executors which have already satisfy what you need ,say 86 with one 
batch.

And the “Process_local” does not necessary means that the data is local, it 
could be that the executor is not available yet for the data source ( in your 
case, might though will be available later).

If this is the case, you could just sleep a few seconds before run the job. or 
there are some patches related and providing other way to sync executors status 
before running applications, but I haven’t track the related status for a while.


Raymond

On 2014年10月20日, at 上午11:22, Tao Xiao  wrote:

> Hi all, 
> 
> I have a Spark-0.9 cluster, which has 16 nodes.
> 
> I wrote a Spark application to read data from an HBase table, which has 86 
> regions spreading over 20 RegionServers.
> 
> I submitted the Spark app in Spark standalone mode and found that there were 
> 86 executors running on just 3 nodes and it took about  30 minutes to read 
> data from the table. In this case, I noticed from Spark master UI that 
> Locality Level of all executors are "PROCESS_LOCAL". 
> 
> Later I ran the same app again (without any code changed) and found that 
> those 86 executors were running on 16 nodes, and this time it took just 4 
> minutes to read date from the same HBase table. In this case, I noticed that 
> Locality Level of most executors are "NODE_LOCAL". 
> 
> After testing multiple times, I found the two cases above occur randomly. 
> 
> So I have 2 questions: 
> 1)  Why would the two cases above occur randomly when I submitted the same 
> application multiple times ?
> 2)  Would the spread of executors influence locality level ?
> 
> Thank you.
> 
>   



default parallelism bug?

2014-10-19 Thread Kevin Jung
Hi,
I usually use file on hdfs to make PairRDD and analyze it by using
combineByKey,reduceByKey, etc.
But sometimes it hangs when I set spark.default.parallelism configuration,
though the size of file is small.
If I remove this configuration, all works fine.
Does anyone tell me why this occur?

Regards,
Kevin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/default-parallelism-bug-tp16787.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Wang, Daoyuan
I have created an issue for this 
https://issues.apache.org/jira/browse/SPARK-4003


From: Cheng, Hao
Sent: Monday, October 20, 2014 9:20 AM
To: Ge, Yao (Y.); Wang, Daoyuan; user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of 
the data types supported by Catalyst.

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 11:44 PM
To: Wang, Daoyuan; user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

scala.MatchError: class java.sql.Timestamp (of class java.lang.Class)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
at 
org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
at 
com.ford.dtc.ff.SessionStats.testTimeStamp(SessionStats.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)


From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Sunday, October 19, 2014 10:31 AM
To: Ge, Yao (Y.); user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

Can you provide the exception stack?

Thanks,
Daoyuan

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 10:17 PM
To: user@spark.apache.org
Subject: scala.MatchError: class java.sql.Timestamp

I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void

Re: How to write a RDD into One Local Existing File?

2014-10-19 Thread Rishi Yadav
Write to hdfs and then get one file locally bu using "hdfs dfs -getmerge..."

On Friday, October 17, 2014, Sean Owen  wrote:

> You can save to a local file. What are you trying and what doesn't work?
>
> You can output one file by repartitioning to 1 partition but this is
> probably not a good idea as you are bottlenecking the output and some
> upstream computation by disabling parallelism.
>
> How about just combining the files on HDFS afterwards? or just reading
> all the files instead of 1? You can hdfs dfs -cat a bunch of files at
> once.
>
> On Fri, Oct 17, 2014 at 6:46 PM, Parthus  > wrote:
> > Hi,
> >
> > I have a spark mapreduce task which requires me to write the final rdd
> to an
> > existing local file (appending to this file). I tried two ways but
> neither
> > works well:
> >
> > 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write
> to
> > local, but I never make it work. Moreover, the result is not one file
> but a
> > series of part-x files which is not what I hope to get.
> >
> > 2. collect the rdd to an array and write it to the driver node using
> Java's
> > File IO. There are also two problems: 1) my RDD is huge(1TB), which
> cannot
> > fit into the memory of one driver node. I have to split the task into
> small
> > pieces and collect them part by part and write; 2) During the writing by
> > Java IO, the Spark Mapreduce task has to wait, which is not efficient.
> >
> > Could anybody provide me an efficient way to solve this problem? I wish
> that
> > the solution could be like: appending a huge rdd to a local file without
> > pausing the MapReduce during writing?
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > For additional commands, e-mail: user-h...@spark.apache.org
> 
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>

-- 
- Rishi


Re: All executors run on just a few nodes

2014-10-19 Thread Tao Xiao
Raymond,

Thank you.

But I read from other thread

that "PROCESS_LOCAL" means the data is in the same JVM as the code that is
running. When data is in the same JVM with the code that is running, the
data should be on the same node as JVM, i.e., the data can be said to be
local.

Also you said that the tasks will be assigned to available executors which
satisfy the application's requirements. But what requirements must an
executor satisfy so that task can be assigned to it? Do you mean resources
(memory, CPU)?

Finally, is there any way to guarantee that all executors for an
application will be run on all Spark nodes when data to be processed is big
enough (for example, HBase data resides on all RegionServers) ?




2014-10-20 11:35 GMT+08:00 raymond :

> My best guess is the speed that your executors got registered with driver
> differs between each run.
>
> when you run it for the first time, the executors is not fully registered
> when task set manager start to assign tasks, and thus the tasks was
> assigned to available executors which have already satisfy what you need
> ,say 86 with one batch.
>
> And the “Process_local” does not necessary means that the data is local,
> it could be that the executor is not available yet for the data source ( in
> your case, might though will be available later).
>
> If this is the case, you could just sleep a few seconds before run the
> job. or there are some patches related and providing other way to sync
> executors status before running applications, but I haven’t track the
> related status for a while.
>
>
> Raymond
>
> On 2014年10月20日, at 上午11:22, Tao Xiao  wrote:
>
> Hi all,
>
> I have a Spark-0.9 cluster, which has 16 nodes.
>
> I wrote a Spark application to read data from an HBase table, which has 86
> regions spreading over 20 RegionServers.
>
> I submitted the Spark app in Spark standalone mode and found that there
> were 86 executors running on just 3 nodes and it took about  30 minutes to
> read data from the table. In this case, I noticed from Spark master UI
> that Locality Level of all executors are "PROCESS_LOCAL".
>
> Later I ran the same app again (without any code changed) and found that
> those 86 executors were running on 16 nodes, and this time it took just 4
> minutes to read date from the same HBase table. In this case, I noticed
> that Locality Level of most executors are "NODE_LOCAL".
>
> After testing multiple times, I found the two cases above occur randomly.
>
> So I have 2 questions:
> 1)  Why would the two cases above occur randomly when I submitted the same
> application multiple times ?
> 2)  Would the spread of executors influence locality level ?
>
> Thank you.
>
>
>
>
>


Re: All executors run on just a few nodes

2014-10-19 Thread raymond
when the data’s source host is not one of the registered executors, it will 
also be marked as PROCESS_LOCAL too, though it should have a different NAME for 
this. I don’t know did someone change this name very recently. but for 0.9, it 
is the case . 

When I say satisfy, yes, if the executors have enough resources to run your 
tasks. all the tasks will be assigned out in your case for the registered 4 
nodes. though the other 16 executors probably will registered in a short while 
later. but when they got registered, all tasks have already been assigned out. 
Not sure is this exactly what happened in your cluster. But seems to me likely 
to be.

Your case might not  be that executors not run on all spark nodes, but they 
don’t get registered quick enough.



On 2014年10月20日, at 下午2:15, Tao Xiao  wrote:

> Raymond,
> 
> Thank you.
> 
> But I read from other thread that "PROCESS_LOCAL" means the data is in the 
> same JVM as the code that is running. When data is in the same JVM with the 
> code that is running, the data should be on the same node as JVM, i.e., the 
> data can be said to be local. 
> 
> Also you said that the tasks will be assigned to available executors which 
> satisfy the application's requirements. But what requirements must an 
> executor satisfy so that task can be assigned to it? Do you mean resources 
> (memory, CPU)? 
> 
> Finally, is there any way to guarantee that all executors for an application 
> will be run on all Spark nodes when data to be processed is big enough (for 
> example, HBase data resides on all RegionServers) ? 
>  
> 
> 
> 
> 2014-10-20 11:35 GMT+08:00 raymond :
> My best guess is the speed that your executors got registered with driver 
> differs between each run.
> 
> when you run it for the first time, the executors is not fully registered 
> when task set manager start to assign tasks, and thus the tasks was assigned 
> to available executors which have already satisfy what you need ,say 86 with 
> one batch.
> 
> And the “Process_local” does not necessary means that the data is local, it 
> could be that the executor is not available yet for the data source ( in your 
> case, might though will be available later).
> 
> If this is the case, you could just sleep a few seconds before run the job. 
> or there are some patches related and providing other way to sync executors 
> status before running applications, but I haven’t track the related status 
> for a while.
> 
> 
> Raymond
> 
> On 2014年10月20日, at 上午11:22, Tao Xiao  wrote:
> 
>> Hi all, 
>> 
>> I have a Spark-0.9 cluster, which has 16 nodes.
>> 
>> I wrote a Spark application to read data from an HBase table, which has 86 
>> regions spreading over 20 RegionServers.
>> 
>> I submitted the Spark app in Spark standalone mode and found that there were 
>> 86 executors running on just 3 nodes and it took about  30 minutes to read 
>> data from the table. In this case, I noticed from Spark master UI that 
>> Locality Level of all executors are "PROCESS_LOCAL". 
>> 
>> Later I ran the same app again (without any code changed) and found that 
>> those 86 executors were running on 16 nodes, and this time it took just 4 
>> minutes to read date from the same HBase table. In this case, I noticed that 
>> Locality Level of most executors are "NODE_LOCAL". 
>> 
>> After testing multiple times, I found the two cases above occur randomly. 
>> 
>> So I have 2 questions: 
>> 1)  Why would the two cases above occur randomly when I submitted the same 
>> application multiple times ?
>> 2)  Would the spread of executors influence locality level ?
>> 
>> Thank you.
>> 
>>   
> 
> 



Re: checkpoint and not running out of disk space

2014-10-19 Thread sivarani
I am new to spark, i am using Spark streaming with Kafka..

My streaming duration is 1s..

Assume i get 100 records in 1s and 120 records in 2s and 80 records in 3s

--> {sec 1   1,2,...100} --> {sec 2 1,2..120} --> {sec 3 1,2,..80}
I apply my logic in sec 1 and have a result => result1

i want to use result1 in 2s and have a combined result of both result1 and
120 records of 2s as => result2

I tried to cache the result but i am not able to get the cached result1 in
2s is it possible.. or show some light on how to achieve my goal here?

 JavaPairReceiverInputDStream messages =  
KafkaUtils.createStream(jssc, String.class,String.class,
StringDecoder.class,StringDecoder.class, kafkaParams, topicMap,
StorageLevel.MEMORY_AND_DISK_SER_2());
i process messages and find word which is the result for 1s ...

if(resultCp!=null){
resultCp.print();
result = resultCp.union(words.mapValues(new Sum()));

}else{
result = words.mapValues(new Sum());
}

 resultCp =  result.cache();
when in 2s the resultCp should not be null but it returns null value so at
any given time i have that particular seconds data alone i want to find the
cumulative result. Do any one know how to do it..

I learnt that once spark streaming is started jssc.start() the control is no
more at our end it lies with spark.. so is it possible to send the result of
1s to 2s to find the accumulated value?

Any help is much much appreciated.. Thanks in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/checkpoint-and-not-running-out-of-disk-space-tp1525p16790.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Upgrade to Spark 1.1.0?

2014-10-19 Thread Dmitriy Lyubimov
Mahout context does not include _all_ possible transitive dependencies.
Would not be lighting fast to take all legacy etc. dependencies.

There's an "ignored" unit test that asserts context path correctness. you
can "uningnore" it and run to verify it still works as ex[ected.The reason
it is set to "ingored" is because it requires mahout environment + already
built mahout in order to run successfully. i can probably look it up if you
don't find it immediately.


Now.
mahout context only includes what's really used in the drm algebra. Which
is just a handful of jars. Apache commons math is not one of them.

But, your driver can add it when creating mahout context, by tinkering
additionally with the method parameters there (such as spark config).
However, you may incounter a problem which may be that mahout assembly
currently may not build -- and copy -- commons math jar into any of mahout
tree.

Finally, i am against adding commons-math by default, as general algebra
does not depend on it. I'd suggest, in order of preference, (1) get rid of
relying on commons math random generator (surely, by now we should be ok
with scala.Random or even standard random?), or (2) add dependency in a
custom way per above.

If there's an extremely compelling reason why commons-math random gen
dependency cannot be eliminated, then a better way is to include commons
math into assembly (i think right now the only assembly that really copies
in dependencies is the examples; which is probably wrong as examples are
not the core product here), and add it explicitly to createMahoutContext
(or whatever that method's name was) code.

My understanding is the random from utils was mainly encouraged because it
is automatically made deterministic in tests. I am unaware any fundamental
deficiencies of scala random w.r.t its uses in existing methods. So perhaps
scala side needs its own "RandomUtils" for testing that do not rely on
commons math.


On Sun, Oct 19, 2014 at 4:36 PM, Pat Ferrel  wrote:

> Trying to upgrade from Spark 1.0.1 to 1.1.0. Can’t imagine the upgrade is
> the problem but anyway...
>
> I get a NoClassDefFoundError for RandomGenerator when running a driver
> from the CLI. But only when using a named master, even a standalone master.
> If I run using master = local[4] the job executes correctly but if I set
> the master to spark://Maclaurin.local:7077 though they are the same machine
> I get the NoClassDefFoundError. The classpath seems correct on the CLI and
> the jars do indeed contain the offending class (see below). There must be
> some difference in how classes are loaded between local[4] and
> spark://Maclaurin.local:7077?
>
> Any ideas?
>
> ===
>
> The driver is in mahout-spark_2.10-1.0-SNAPSHOT-job.jar so it’s execution
> means it must be in the classpath. When I look at what’s in the jar I see
> RandomGenerator.
>
> Maclaurin:target pat$ jar tf mahout-spark_2.10-1.0-SNAPSHOT-job.jar | grep
> RandomGenerator
> cern/jet/random/engine/RandomGenerator.class
> org/apache/commons/math3/random/GaussianRandomGenerator.class
> org/apache/commons/math3/random/JDKRandomGenerator.class
> org/apache/commons/math3/random/UniformRandomGenerator.class
> org/apache/commons/math3/random/RandomGenerator.class  <==!
> org/apache/commons/math3/random/NormalizedRandomGenerator.class
> org/apache/commons/math3/random/AbstractRandomGenerator.class
> org/apache/commons/math3/random/StableRandomGenerator.class
>
> But get the following error executing the job:
>
> 14/10/19 15:39:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 6.9 (TID 84, 192.168.0.2): java.lang.NoClassDefFoundError:
> org/apache/commons/math3/random/RandomGenerator
> org.apache.mahout.common.RandomUtils.getRandom(RandomUtils.java:65)
>
> org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$5.apply(SimilarityAnalysis.scala:272)
>
> org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$5.apply(SimilarityAnalysis.scala:267)
>
> org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
>
> org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
>
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apac