date:20150311

Re: Read parquet folders recursively

2015-03-11 Thread Akhil Das

Hi

We have a custom build to read directories recursively, Currently we use it
with fileStream like:

val lines = ssc.fileStream[LongWritable, Text,
TextInputFormat]("/datadumps/",
 (t: Path) => true, true, *true*)

Making the 4th argument true to read recursively.

You could give it a try
https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz

Thanks
Best Regards

On Wed, Mar 11, 2015 at 9:45 PM, Masf  wrote:

> Hi all
>
> Is it possible to read recursively folders to read parquet files?
>
>
> Thanks.
>
> --
>
>
> Saludos.
> Miguel Ángel
>

Re: StreamingListener

2015-03-11 Thread Akhil Das

At the end of foreachrdd i believe.

Thanks
Best Regards

On Thu, Mar 12, 2015 at 6:48 AM, Corey Nolet  wrote:

> Given the following scenario:
>
> dstream.map(...).filter(...).window(...).foreachrdd()
>
> When would the onBatchCompleted fire?
>
>

Re: bad symbolic reference. A signature in SparkContext.class refers to term conf in value org.apache.hadoop which is not available

2015-03-11 Thread Akhil Das

Spark 1.3.0 is not officially out yet, so i don't think sbt will download
the hadoop dependencies for your spark by itself. You could try manually
adding the hadoop dependencies yourself (hadoop-core, hadoop-common,
hadoop-client)

Thanks
Best Regards

On Wed, Mar 11, 2015 at 9:07 PM, Patcharee Thongtra <
patcharee.thong...@uni.no> wrote:

> Hi,
>
> I have built spark version 1.3 and tried to use this in my spark scala
> application. When I tried to compile and build the application by SBT, I
> got error>
> bad symbolic reference. A signature in SparkContext.class refers to term
> conf in value org.apache.hadoop which is not available
>
> It seems hadoop library is missing, but it should be referred
> automatically by SBT, isn't it.
>
> This application is buit-able on spark version 1.2
>
> Here is my build.sbt
>
> name := "wind25t-v013"
> version := "0.1"
> scalaVersion := "2.10.4"
> unmanagedBase := baseDirectory.value / "lib"
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.3.0"
> libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.3.0"
>
> What should I do to fix it?
>
> BR,
> Patcharee
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: hbase sql query

2015-03-11 Thread Akhil Das

Like this?

val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
  classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  classOf[org.apache.hadoop.hbase.client.Result]).cache()


Here's a complete example

.

Thanks
Best Regards

On Wed, Mar 11, 2015 at 4:46 PM, Udbhav Agarwal 
wrote:

>  Hi,
>
> How can we simply cache hbase table and do sql query via java api in spark.
>
>
>
>
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-11 Thread raghav0110.cs



 Thank you very much for your detailed response, it was very informative and 
cleared up some of my misconceptions. After your explanation, I understand that 
the distribution of the data and parallelism is all meant to be an abstraction 
to the developer. 


In your response you say “When you call reduce and similar methods, each 
partition can be reduced in parallel. Then the results of that can be 
transferred across the network and reduced to the final result”. By similar 
methods do you mean all actions within spark? Does transfer of data from worker 
nodes to driver nodes happen only when an action is performed?


I am assuming that in Spark, you typically have a set of transformations 
followed by some sort of action. The RDD is partitioned and sent to different 
worker nodes(assuming this a cluster setup), the transformations are applied to 
the RDD partitions at the various worker nodes, and then when an action is 
performed, you perform the action on the worker nodes and then aggregate the 
partial results at the driver and then perform another reduction at the driver 
to obtain the overall results. I would also assume that deciding whether the 
action should be done on a worker node, depends on the type of action. For 
example, performing reduce at the worker node makes sense, while it doesn't 
make sense to save the file at the worker node.  Does that sound correct, or am 
I misinterpreting something?






Thanks,

Raghav





From: Daniel Siegmann
Sent: ‎Thursday‎, ‎March‎ ‎5‎, ‎2015 ‎2‎:‎01‎ ‎PM
To: Raghav Shankar
Cc: user@spark.apache.org








An RDD is a Resilient Distributed Data set. The partitioning and distribution 
of the data happens in the background. You'll occasionally need to concern 
yourself with it (especially to get good performance), but from an API 
perspective it's mostly invisible (some methods do allow you to specify a 
number of partitions).


When you call sc.textFile(myPath) or similar, you get an RDD. That RDD will be 
composed of a bunch of partitions, but you don't really need to worry about 
that. The partitioning will be based on how the data is stored. When you call a 
method that causes a shuffle (such as reduce), the data is repartitioned into a 
number of partitions based on your default parallelism setting (which IIRC is 
based on your number of cores if you haven't set it explicitly).

When you call reduce and similar methods, each partition can be reduced in 
parallel. Then the results of that can be transferred across the network and 
reduced to the final result. You supply the function and Spark handles the 
parallel execution of that function.

I hope this helps clear up your misconceptions. You might also want to 
familiarize yourself with the collections API in Java 8 (or Scala, or Python, 
or pretty much any other language with lambda expressions), since RDDs are 
meant to have an API that feels similar.



On Thu, Mar 5, 2015 at 9:45 AM, raggy  wrote:

I am trying to use Apache spark to load up a file, and distribute the file to
several nodes in my cluster and then aggregate the results and obtain them.
I don't quite understand how to do this.

From my understanding the reduce action enables Spark to combine the results
from different nodes and aggregate them together. Am I understanding this
correctly?

From a programming perspective, I don't understand how I would code this
reduce function.

How exactly do I partition the main dataset into N pieces and ask them to be
parallel processed by using a list of transformations?

reduce is supposed to take in two elements and a function for combining
them. Are these 2 elements supposed to be RDDs from the context of Spark or
can they be any type of element? Also, if you have N different partitions
running parallel, how would reduce aggregate all their results into one
final result(since the reduce function aggregates only 2 elements)?

Also, I don't understand this example. The example from the spark website
uses reduce, but I don't see the data being processed in parallel. So, what
is the point of the reduce? If I could get a detailed explanation of the
loop in this example, I think that would clear up most of my questions.

class ComputeGradient extends Function {
  private Vector w;
  ComputeGradient(Vector w) { this.w = w; }
  public Vector call(DataPoint p) {
return p.x.times(p.y * (1 / (1 + Math.exp(w.dot(p.x))) - 1));
  }
}

JavaRDD points = spark.textFile(...).map(new
ParsePoint()).cache();
Vector w = Vector.random(D); // current separating plane
for (int i = 0; i < ITERATIONS; i++) {
  Vector gradient = points.map(new ComputeGradient(w)).reduce(new
AddVectors());
  w = w.subtract(gradient);
}
System.out.println("Final separating plane: " + w);

Also, I have been trying to find the source code for reduce from the Apache
Spark Github, but the source is pretty huge and I haven't been able to
pinpoint it. Could someone please direct me towards which file I could find
it in?



--
View this m

Re: sc.textFile() on windows cannot access UNC path

2015-03-11 Thread Akhil Das

Sounds like the way of doing it. Could you try accessing a file from UNC
Path with native Java nio code and make sure it is able access it with the
URI file:10.196.119.230/folder1/abc.txt?

Thanks
Best Regards

On Wed, Mar 11, 2015 at 7:45 PM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

>  Thanks for the reference. Is the following procedure correct?
>
>
>
> 1.Copy of the Hadoop source code
> org.apache.hadoop.mapreduce.lib.input .TextInputFormat.java as my own
> class, e.g. UncTextInputFormat.java
>
> 2.Modify UncTextInputFormat.java to handle UNC path
>
> 3.Call sc.newAPIHadoopFile(…) with
>
>
>
> sc.newAPIHadoopFile[LongWritable, Text, UncTextInputFormat](“file:
> 10.196.119.230/folder1/abc.txt”,
>
>  classOf[UncTextInputFormat],
>
>  classOf[LongWritable],
>
> classOf[Text], conf)
>
>
>
> Ningjun
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* Wednesday, March 11, 2015 2:40 AM
> *To:* Wang, Ningjun (LNG-NPV)
> *Cc:* java8964; user@spark.apache.org
>
> *Subject:* Re: sc.textFile() on windows cannot access UNC path
>
>
>
> 
>
> I don't have a complete example for your usecase, but you can see a lot of
> codes showing how to use new APIHadoopFile from here
> 
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Tue, Mar 10, 2015 at 7:37 PM, Wang, Ningjun (LNG-NPV) <
> ningjun.w...@lexisnexis.com> wrote:
>
> This sounds like the right approach. Is there any sample code showing how
> to use sc.newAPIHadoopFile  ? I am new to Spark and don’t know much about
> Hadoop. I just want to read a text file from UNC path into an RDD.
>
>
>
> Thanks
>
>
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* Tuesday, March 10, 2015 9:14 AM
> *To:* java8964
> *Cc:* Wang, Ningjun (LNG-NPV); user@spark.apache.org
> *Subject:* Re: sc.textFile() on windows cannot access UNC path
>
>
>
> You can create your own Input Reader (using java.nio.*) and pass it to the
> sc.newAPIHadoopFile while reading.
>
>
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Tue, Mar 10, 2015 at 6:28 PM, java8964  wrote:
>
> I think the work around is clear.
>
>
>
> Using JDK 7, and implement your own saveAsRemoteWinText() using
> java.nio.path.
>
>
>
> Yong
>  --
>
> From: ningjun.w...@lexisnexis.com
> To: java8...@hotmail.com; user@spark.apache.org
> Subject: RE: sc.textFile() on windows cannot access UNC path
> Date: Tue, 10 Mar 2015 03:02:37 +
>
>
>
> Hi Yong
>
>
>
> Thanks for the reply. Yes it works with local drive letter. But I really
> need to use UNC path because the path is input from at runtime. I cannot
> dynamically assign a drive letter to arbitrary UNC path at runtime.
>
>
>
> Is there any work around that I can use UNC path for sc.textFile(…)?
>
>
>
>
>
> Ningjun
>
>
>
>
>
> *From:* java8964 [mailto:java8...@hotmail.com]
> *Sent:* Monday, March 09, 2015 5:33 PM
> *To:* Wang, Ningjun (LNG-NPV); user@spark.apache.org
> *Subject:* RE: sc.textFile() on windows cannot access UNC path
>
>
>
> This is a Java problem, not really Spark.
>
>
>
> From this page:
> http://stackoverflow.com/questions/18520972/converting-java-file-url-to-file-path-platform-independent-including-u
>
>
>
> You can see that using Java.nio.* on JDK 7, it will fix this issue. But
> Path class in Hadoop will use java.io.*, instead of java.nio.
>
>
>
> You need to manually mount your windows remote share a local driver, like
> "Z:", then it should work.
>
>
>
> Yong
>  --
>
> From: ningjun.w...@lexisnexis.com
> To: user@spark.apache.org
> Subject: sc.textFile() on windows cannot access UNC path
> Date: Mon, 9 Mar 2015 21:09:38 +
>
> I am running Spark on windows 2008 R2. I use sc.textFile() to load text
> file  using UNC path, it does not work.
>
>
>
> *sc*.textFile(*raw"file:10.196.119.230/folder1/abc.txt"*, 4).count()
>
>
>
> Input path does not exist: file:/10.196.119.230/folder1/abc.txt
>
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/10.196.119.230/tar/Enron/enron-207-short.load
>
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
>
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
>
> at
> org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
>
> at
> org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>
> at
> org.apache.spark.rdd.RDD$$

Re: "Timed out while stopping the job generator" plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer

Hi,

I discovered what caused my issue when running on YARN and was able to work
around it.

On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer  wrote:

> The processing itself is complete, i.e., the batch currently processed at
> the time of stop() is finished and no further batches are processed.
> However, something keeps the streaming context from stopping properly. In
> local[n] mode, this is not actually a problem (other than I have to wait 20
> seconds for shutdown), but in yarn-cluster mode, I get an error
>
>   akka.actor.InvalidActorNameException: actor name [JobGenerator] is not
> unique!
>

It seems that not all checkpointed RDDs are cleaned (metadata cleared,
checkpoint directories deleted etc.?) at the time when the streamingContext
is stopped, but only afterwards. In particular, when I add
`Thread.sleep(5000)` after my streamingContext.stop() call, then it works
and I can start a different streamingContext afterwards.

This is pretty ugly, so does anyone know a method to poll whether it's safe
to continue or whether there are still some RDDs waiting to be cleaned up?

Thanks
Tobias

Re: SVD transform of large matrix with MLlib

2015-03-11 Thread Reza Zadeh

Answers:
databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html
Reza

On Wed, Mar 11, 2015 at 2:33 PM, sergunok  wrote:

> Does somebody used SVD from MLlib for very large (like 10^6 x 10^7) sparse
> matrix?
> What time did it take?
> What implementation of SVD is used in MLLib?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SVD-transform-of-large-matrix-with-MLlib-tp22005.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen

Those results look very good for the larger workloads (100MB and 1GB). Were
you also able to run experiments for smaller amounts of data? For instance
broadcasting a single variable to the entire cluster? In the paper you
state that HDFS-based mechanisms performed well only for small amounts of
data. Do you have an approximation for the trade-off point when HDFS-based
becomes more favorable, and BitTorrent-like performs worse? I also read
that the minimum size transmitted using a broadcast variable is 4MB. Maybe
I should look for a different way of sharing this constant?

Use case: I am looking for the most efficient way to perform a
transformation involving a constant (of which the value is determined at
runtime) for a large input file.

Scala example:
var constant1 = sc.broadcast(2) // The actual value, 2 in this case, would
be a result from a different function, generated during runtime
val result = input.map(x => x + constant1.value)

On 11 March 2015 at 21:13, Mosharaf Chowdhury 
wrote:

> The current broadcast algorithm in Spark approximates the one described
> in the Section 5 of this paper
> .
> It is expected to scale sub-linearly; i.e., O(log N), where N is the
> number of machines in your cluster.
> We evaluated up to 100 machines, and it does follow O(log N) scaling.
>
> --
> Mosharaf Chowdhury
> http://www.mosharaf.com/
>
> On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen 
> wrote:
>
>> Thanks Mosharaf, for the quick response! Can you maybe give me some
>> pointers to an explanation of this strategy? Or elaborate a bit more on it?
>> Which parts are involved in which way? Where are the time penalties and how
>> scalable is this implementation?
>>
>> Thanks again,
>>
>> Tom
>>
>> On 11 March 2015 at 16:01, Mosharaf Chowdhury 
>> wrote:
>>
>>> Hi Tom,
>>>
>>> That's an outdated document from 4/5 years ago.
>>>
>>> Spark currently uses a BitTorrent like mechanism that's been tuned for
>>> datacenter environments.
>>>
>>> Mosharaf
>>> --
>>> From: Tom 
>>> Sent: ‎3/‎11/‎2015 4:58 PM
>>> To: user@spark.apache.org
>>> Subject: Which strategy is used for broadcast variables?
>>>
>>> In "Performance and Scalability of Broadcast in Spark" by Mosharaf
>>> Chowdhury
>>> I read that Spark uses HDFS for its broadcast variables. This seems
>>> highly
>>> inefficient. In the same paper alternatives are proposed, among which
>>> "Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105,
>>> second paragraph about Broadcast Variables, I read " The value is sent to
>>> each node only once, using an efficient, BitTorrent-like communication
>>> mechanism."
>>>
>>> - Is the book talking about the proposed BTB from the paper?
>>>
>>> - Is this currently the default?
>>>
>>> - If not, what is?
>>>
>>> Thanks,
>>>
>>> Tom
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Which-strategy-is-used-for-broadcast-variables-tp22004.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury

The current broadcast algorithm in Spark approximates the one described in
the Section 5 of this paper
.
It is expected to scale sub-linearly; i.e., O(log N), where N is the number
of machines in your cluster.
We evaluated up to 100 machines, and it does follow O(log N) scaling.

--
Mosharaf Chowdhury
http://www.mosharaf.com/

On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen 
wrote:

> Thanks Mosharaf, for the quick response! Can you maybe give me some
> pointers to an explanation of this strategy? Or elaborate a bit more on it?
> Which parts are involved in which way? Where are the time penalties and how
> scalable is this implementation?
>
> Thanks again,
>
> Tom
>
> On 11 March 2015 at 16:01, Mosharaf Chowdhury 
> wrote:
>
>> Hi Tom,
>>
>> That's an outdated document from 4/5 years ago.
>>
>> Spark currently uses a BitTorrent like mechanism that's been tuned for
>> datacenter environments.
>>
>> Mosharaf
>> --
>> From: Tom 
>> Sent: ‎3/‎11/‎2015 4:58 PM
>> To: user@spark.apache.org
>> Subject: Which strategy is used for broadcast variables?
>>
>> In "Performance and Scalability of Broadcast in Spark" by Mosharaf
>> Chowdhury
>> I read that Spark uses HDFS for its broadcast variables. This seems highly
>> inefficient. In the same paper alternatives are proposed, among which
>> "Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105,
>> second paragraph about Broadcast Variables, I read " The value is sent to
>> each node only once, using an efficient, BitTorrent-like communication
>> mechanism."
>>
>> - Is the book talking about the proposed BTB from the paper?
>>
>> - Is this currently the default?
>>
>> - If not, what is?
>>
>> Thanks,
>>
>> Tom
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Which-strategy-is-used-for-broadcast-variables-tp22004.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Top, takeOrdered, sortByKey

2015-03-11 Thread Saba Sehrish

Let me clarify - taking 1 elements of 5 elements using top or 
takeOrdered is taking about 25-50s which seems to be slow. I also try to use 
sortByKey to sort the elements to get a time estimate and I get numbers in the 
same range.
I'm running this application on a cluster with 5 nodes and using 3 cores each.

I'm interested in knowing what tuning can be done to get better performance for 
top/takeOrdered  than 25-50s. I want to eventually scale it up to use 77 
million elements but right now I want to see how performance of either top or 
takeOrdered could be improved for smaller sample I'm using.

On Mar 11, 2015, at 7:24 PM, Imran Rashid 
mailto:iras...@cloudera.com>> wrote:

I am not entirely sure I understand your question -- are you saying:

* scoring a sample of 50k events is fast
* taking the top N scores of 77M events is slow, no matter what N is

?

if so, this shouldn't come as a huge surprise.  You can't find the top scoring 
elements (no matter how small N is) unless you score all 77M of them.  Very 
naively, you would expect scoring 77M events to take ~1000 times as long as 
scoring 50k events, right?  The fact that it doesn't take that much longer is 
probably b/c of the overhead of just launching the jobs.



On Mon, Mar 9, 2015 at 4:21 PM, Saba Sehrish 
mailto:ssehr...@fnal.gov>> wrote:


From: Saba Sehrish mailto:ssehr...@fnal.gov>>
Date: March 9, 2015 at 4:11:07 PM CDT
To: mailto:user-...@spark.apache.org>>
Subject: Using top, takeOrdered, sortByKey

I am using spark for a template matching problem. We have 77 million events in 
the template library, and we compare energy of each of the input event with the 
each of the template event and return a score. In the end we return best 1 
matches with lowest score. A score of 0 is a perfect match.

I down sampled the problem to use only 50k events. For a single event matching 
across all the events in the template (50k) I see 150-200ms for score 
calculation on 25 cores (using YARN cluster), but after that when I perform 
either a top or takeOrdered or even sortByKey the time reaches to 25-50s.
So far I am not able to figure out why such a huge gap going from a list of 
scores to a list of top 1000 scores and why sorting or getting best X matches 
is being dominant by a large factor. One thing I have noticed is that it 
doesn’t matter how many elements I return the time range is the same 25-50s for 
10 - 1 elements.

Any suggestions? if I am not using API properly?

scores is JavaPairRDD, and I do something like
numbestmatches is 10, 100, 1 or any number.

List > bestscores_list = 
scores.takeOrdered(numbestmatches, new TupleComparator());
Or
List > bestscores_list = scores.top(numbestmatches, new 
TupleComparator());
Or
List > bestscores_list = scores.sortByKey();

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Tathagata Das

Can you show us the code that you are using?

This might help. This is the updated streaming programming guide for 1.3,
soon to be up, this is a quick preview.
http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations

TD

On Wed, Mar 11, 2015 at 12:20 PM, Marius Soutier  wrote:

> Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)).
>
> > On 11.03.2015, at 18:35, Marius Soutier  wrote:
> >
> > Hi,
> >
> > I’ve written a Spark Streaming Job that inserts into a Parquet, using
> stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added
> checkpointing; everything works fine when starting from scratch. When
> starting from a checkpoint however, the job doesn’t work and produces the
> following exception in the foreachRDD:
> >
> > ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running
> job streaming job 142609383 ms.2
> > org.apache.spark.SparkException: RDD transformations and actions can
> only be invoked by the driver, not inside of other transformations; for
> example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
> values transformation and count action cannot be performed inside of the
> rdd1.map transformation. For more information, see SPARK-5063.
> >   at org.apache.spark.rdd.RDD.sc(RDD.scala:90)
> >   at org.apache.spark.rdd.RDD.(RDD.scala:143)
> >   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
> >   at
> org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:114)
> >   at
> MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
> >   at
> MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
> >   at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
> >   at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
> >   at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
> >   at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> >   at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> >
> >
> >
> >
> > Cheers
> > - Marius
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

StreamingListener

2015-03-11 Thread Corey Nolet

Given the following scenario:

dstream.map(...).filter(...).window(...).foreachrdd()

When would the onBatchCompleted fire?

Re: can spark take advantage of ordered data?

2015-03-11 Thread Imran Rashid

Hi Jonathan,

you might be interested in https://issues.apache.org/jira/browse/SPARK-3655
(not yet available) and https://github.com/tresata/spark-sorted (not part
of spark, but it is available right now).  Hopefully thats what you are
looking for.  To the best of my knowledge that covers what is available now
/ what is being worked on.

Imran

On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney 
wrote:

> Hello all,
>
> I am wondering if spark already has support for optimizations on sorted
> data and/or if such support could be added (I am comfortable dropping to a
> lower level if necessary to implement this, but I'm not sure if it is
> possible at all).
>
> Context: we have a number of data sets which are essentially already
> sorted on a key. With our current systems, we can take advantage of this to
> do a lot of analysis in a very efficient fashion...merges and joins, for
> example, can be done very efficiently, as can folds on a secondary key and
> so on.
>
> I was wondering if spark would be a fit for implementing these sorts of
> optimizations? Obviously it is sort of a niche case, but would this be
> achievable? Any pointers on where I should look?
>

Re: Running Spark from Scala source files other than main file

2015-03-11 Thread Imran Rashid

did you forget to specify the main class w/ "--class Main"?  though if that
was it, you should at least see *some* error message, so I'm confused
myself ...

On Wed, Mar 11, 2015 at 6:53 AM, Aung Kyaw Htet  wrote:

> Hi Everyone,
>
> I am developing a scala app, in which the main object does not call the
> SparkContext, but another object defined in the same package creates it,
> run spark operations and closes it. The jar file is built successfully in
> maven, but when I called spark-submit with this jar, that spark does not
> seem to execute any code.
>
> So my code looks like
>
> [Main.scala]
>
> object Main(args) {
>   def main() {
> /*check parameters */
>  Component1.start(parameters)
> }
>   }
>
> [Component1.scala]
>
> object Component1{
>   def start{
>val sc = new SparkContext(conf)
>/* do spark operations */
>sc.close()
>   }
> }
>
> The above code compiles into Main.jar but spark-submit does not execute
> anything and does not show me any error (not in the logs either.)
>
> spark-submit master= spark:// Main.jar
>
> I've got this all the code working before when I wrote a single scala
> file, but now that I am separating into multiple scala source files,
> something isn't running right.
>
> Any advice on this would be greatly appreciated!
>
> Regards,
> Aung
>

Re: How to use more executors

2015-03-11 Thread Nan Zhu

I think this should go to another PR

can you create a JIRA on that?

Best,  

--  
Nan Zhu
http://codingcat.me


On Wednesday, March 11, 2015 at 8:50 PM, Du Li wrote:

> Is it possible to extend this PR further (or create another PR) to allow for 
> per-node configuration of workers?  
>  
> There are many discussions about heterogeneous spark cluster. Currently 
> configuration on master will override those on the workers. Many spark users 
> have the need for having machines with different cpu/memory capacities in the 
> same cluster.
>  
> Du  
>  
>  
> On Wednesday, January 21, 2015 3:59 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
>  
>  
> …not sure when will it be reviewed…
>  
> but for now you can work around by allowing multiple worker instances on a 
> single machine  
>  
> http://spark.apache.org/docs/latest/spark-standalone.html
>  
> search SPARK_WORKER_INSTANCES
>  
> Best,  
>  
> --  
> Nan Zhu
> http://codingcat.me
>  
> On Wednesday, January 21, 2015 at 6:50 PM, Larry Liu wrote:
> > Will  SPARK-1706 be included in next release?
> >  
> > On Wed, Jan 21, 2015 at 2:50 PM, Ted Yu  > (mailto:yuzhih...@gmail.com)> wrote:
> > > Please see SPARK-1706
> > >  
> > > On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu  > > (mailto:larryli...@gmail.com)> wrote:
> > > > I tried to submit a job with  --conf "spark.cores.max=6"  or 
> > > > --total-executor-cores 6 on a standalone cluster. But I don't see more 
> > > > than 1 executor on each worker. I am wondering how to use multiple 
> > > > executors when submitting jobs.
> > > >  
> > > > Thanks
> > > > larry
> > > >  
> > > >  
> > >  
> > >  
> > >  
> >  
>  
>  
>

Re: Numbering RDD members Sequentially

2015-03-11 Thread Mark Hamstra

>
> not quite sure why it is called zipWithIndex since zipping is not involved
>

It isn't?
http://stackoverflow.com/questions/1115563/what-is-zip-functional-programming

On Wed, Mar 11, 2015 at 5:18 PM, Steve Lewis  wrote:

>
> -- Forwarded message --
> From: Steve Lewis 
> Date: Wed, Mar 11, 2015 at 9:13 AM
> Subject: Re: Numbering RDD members Sequentially
> To: "Daniel, Ronald (ELS-SDG)" 
>
>
> perfect - exactly what I was looking for, not quite sure why it is called 
> zipWithIndex
> since zipping is not involved
> my code does something like this where IMeasuredSpectrum is a large class
> we want to set an index for
>
> public static JavaRDD 
> indexSpectra(JavaRDD pSpectraToScore) {
>
> JavaPairRDD indexed = 
> pSpectraToScore.zipWithIndex();
>
> pSpectraToScore = indexed.map(new AddIndexToSpectrum()) ;
> return pSpectraToScore;
> }
>
> public class AddIndexToSpectrum implements  
> Function, IMeasuredSpectrum> {
> @Override
> public IMeasuredSpectrum doCall(final Tuple2 java.lang.Long> v1) throws Exception {
> IMeasuredSpectrum spec = v1._1();
> long index = v1._2();
> spec.setIndex(   index + 1 );
>  return spec;
> }
>
>}
>
>  }
>
>
> On Wed, Mar 11, 2015 at 6:57 AM, Daniel, Ronald (ELS-SDG) <
> r.dan...@elsevier.com> wrote:
>
>>  Have you looked at zipWithIndex?
>>
>>
>>
>> *From:* Steve Lewis [mailto:lordjoe2...@gmail.com]
>> *Sent:* Tuesday, March 10, 2015 5:31 PM
>> *To:* user@spark.apache.org
>> *Subject:* Numbering RDD members Sequentially
>>
>>
>>
>> I have Hadoop Input Format which reads records and produces
>>
>>
>>
>> JavaPairRDD locatedData  where
>>
>> _1() is a formatted version of the file location - like
>>
>> "12690",, "24386 ."27523 ...
>>
>> _2() is data to be processed
>>
>>
>>
>> For historical reasons  I want to convert _1() into in integer
>> representing the record number.
>>
>> so keys become "0001", "002" ...
>>
>>
>>
>> (Yes I know this cannot be done in parallel) The PairRDD may be too large
>> to collect and work on one machine but small enough to handle on a single
>> machine.
>>  I could use toLocalIterator to guarantee execution on one machine but
>> last time I tried this all kinds of jobs were launched to get the next
>> element of the iterator and I was not convinced this approach was efficient.
>>
>>
>>
>
>

Re: How to use more executors

2015-03-11 Thread Du Li

Is it possible to extend this PR further (or create another PR) to allow for 
per-node configuration of workers? 
There are many discussions about heterogeneous spark cluster. Currently 
configuration on master will override those on the workers. Many spark users 
have the need for having machines with different cpu/memory capacities in the 
same cluster.
Du 

 On Wednesday, January 21, 2015 3:59 PM, Nan Zhu  
wrote:
   

 …not sure when will it be reviewed…
but for now you can work around by allowing multiple worker instances on a 
single machine 
http://spark.apache.org/docs/latest/spark-standalone.html
search SPARK_WORKER_INSTANCES
Best, 
-- Nan Zhuhttp://codingcat.me On Wednesday, January 21, 2015 at 6:50 PM, Larry 
Liu wrote: 
 Will  SPARK-1706 be included in next release?
On Wed, Jan 21, 2015 at 2:50 PM, Ted Yu  wrote:

Please see SPARK-1706
On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu  wrote:

I tried to submit a job with  --conf "spark.cores.max=6"  or 
--total-executor-cores 6 on a standalone cluster. But I don't see more than 1 
executor on each worker. I am wondering how to use multiple executors when 
submitting jobs.
Thankslarry

Re: "Timed out while stopping the job generator" plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer

Sean,

On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer  wrote:
>
> it seems like I am unable to shut down my StreamingContext properly, both
> in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster
> mode, subsequent use of a new StreamingContext will raise
> an InvalidActorNameException.
>

I was wondering if this is related to your question on spark-dev
  http://tinyurl.com/q5cd5px
Did you get any additional insight into this issue?

In my case the processing of the first batch completes, but I don't know if
there is anything wrong with the checkpoints? When I look to the
corresponding checkpoint directory in HDFS, it doesn't seem like all state
RDDs are persisted there, just a subset. Any ideas?

Thanks
Tobias

RE: Spark SQL using Hive metastore

2015-03-11 Thread Cheng, Hao

Hi, Robert

Spark SQL currently only support Hive 0.12.0(need to re-compile the package) 
and 0.13.1(by default), I am not so sure if it supports the Hive 0.14 metastore 
service as backend.  Another way you can try is configure the 
$SPARK_HOME/conf/hive-site.xml to access the remote metastore database 
directly("javax.jdo.option.ConnectionURL” and 
“javax.jdo.option.ConnectionDriverName” required); And then you can start the 
Spark SQL like:
bin/spark-sql --jars lib/mysql-connector-xxx.jar

For the “SnappyError”, seems you didn’t configure the snappy native lib well 
for Spark, can you check the configuration file of 
$SPARK_HOME/conf/spark-xxx.conf ?

Cheng Hao

From: Grandl Robert [mailto:rgra...@yahoo.com.INVALID]
Sent: Thursday, March 12, 2015 5:07 AM
To: user@spark.apache.org
Subject: Spark SQL using Hive metastore

Hi guys,

I am a newbie in running Spark SQL / Spark. My goal is to run some TPC-H 
queries atop Spark SQL using Hive metastore.

It looks like spark 1.2.1 release has Spark SQL / Hive support. However, I am 
not able to fully connect all the dots.

I did the following:
1. Copied hive-site.xml from hive to spark/conf
2. Copied mysql connector to spark/lib
3. I have started hive metastore service: hive --service metastore
3. I have started ./bin/spark-sql
4. I typed: spark-sql> show tables;
However, the following error was thrown:
 Job 0 failed: collect at SparkPlan.scala:84, took 0.241788 s
15/03/11 15:02:35 ERROR SparkSQLDriver: Failed in [show tables]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
serialization failed: org.xerial.snappy.SnappyError: 
[FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Linux 
and os.arch=aarch64

Do  you know what I am doing wrong ? I mention that I have hive-0.14 instead of 
hive-0.13.

And another question: What is the right command to run sql queries with spark 
sql using hive metastore ?

Thanks,
Robert

Re: How to use more executors

2015-03-11 Thread Nan Zhu

at least 1.4 I think  

now using YARN or allowing multiple worker instances are just fine

Best,  

--  
Nan Zhu
http://codingcat.me


On Wednesday, March 11, 2015 at 8:42 PM, Du Li wrote:

> Is it being merged in the next release? It's indeed a critical patch!
>  
> Du  
>  
>  
> On Wednesday, January 21, 2015 3:59 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
>  
>  
> …not sure when will it be reviewed…
>  
> but for now you can work around by allowing multiple worker instances on a 
> single machine  
>  
> http://spark.apache.org/docs/latest/spark-standalone.html
>  
> search SPARK_WORKER_INSTANCES
>  
> Best,  
>  
> --  
> Nan Zhu
> http://codingcat.me
>  
> On Wednesday, January 21, 2015 at 6:50 PM, Larry Liu wrote:
> > Will  SPARK-1706 be included in next release?
> >  
> > On Wed, Jan 21, 2015 at 2:50 PM, Ted Yu  > (mailto:yuzhih...@gmail.com)> wrote:
> > > Please see SPARK-1706
> > >  
> > > On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu  > > (mailto:larryli...@gmail.com)> wrote:
> > > > I tried to submit a job with  --conf "spark.cores.max=6"  or 
> > > > --total-executor-cores 6 on a standalone cluster. But I don't see more 
> > > > than 1 executor on each worker. I am wondering how to use multiple 
> > > > executors when submitting jobs.
> > > >  
> > > > Thanks
> > > > larry
> > > >  
> > > >  
> > >  
> > >  
> > >  
> >  
>  
>  
>

Re: saveAsTextFile extremely slow near finish

2015-03-11 Thread Imran Rashid

is your data skewed?  Could it be that there are a few keys with a huge
number of records?  You might consider outputting
(recordA, count)
(recordB, count)

instead of

recordA
recordA
recordA
...


you could do this with:

input = sc.textFile
pairsCounts = input.map{x => (x,1)}.reduceByKey{_ + _}
sorted = pairs.sortByKey
sorted.saveAsTextFile


On Mon, Mar 9, 2015 at 12:31 PM, mingweili0x  wrote:

> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on composite keys, and then save the partitioned result back to
> HDFS.
> pseudo code is like this:
>
> input = sc.textFile
> pairs = input.mapToPair
> sorted = pairs.sortByKey
> values = sorted.values
> values.saveAsTextFile
>
>  Input size is ~ 160G, and I made 1000 partitions specified in
> JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
> splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
> in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
> and the last few jobs just took forever and never finishes.
>
> Cluster setup:
> 8 nodes
> on each node: 15gb memory, 8 cores
>
> running parameters:
> --executor-memory 12G
> --conf "spark.cores.max=60"
>
> Thank you for any help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to use more executors

2015-03-11 Thread Du Li

Is it being merged in the next release? It's indeed a critical patch!
Du 

 On Wednesday, January 21, 2015 3:59 PM, Nan Zhu  
wrote:
   

 …not sure when will it be reviewed…
but for now you can work around by allowing multiple worker instances on a 
single machine 
http://spark.apache.org/docs/latest/spark-standalone.html
search SPARK_WORKER_INSTANCES
Best, 
-- Nan Zhuhttp://codingcat.me On Wednesday, January 21, 2015 at 6:50 PM, Larry 
Liu wrote: 
 Will  SPARK-1706 be included in next release?
On Wed, Jan 21, 2015 at 2:50 PM, Ted Yu  wrote:

Please see SPARK-1706
On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu  wrote:

I tried to submit a job with  --conf "spark.cores.max=6"  or 
--total-executor-cores 6 on a standalone cluster. But I don't see more than 1 
executor on each worker. I am wondering how to use multiple executors when 
submitting jobs.
Thankslarry

RE: can spark take advantage of ordered data?

2015-03-11 Thread java8964

RangePartitioner?
At least for join, you can implement your own partitioner, to utilize the 
sorted data.
Just my 2 cents.
Date: Wed, 11 Mar 2015 17:38:04 -0400
Subject: can spark take advantage of ordered data?
From: jcove...@gmail.com
To: User@spark.apache.org

Hello all,
I am wondering if spark already has support for optimizations on sorted data 
and/or if such support could be added (I am comfortable dropping to a lower 
level if necessary to implement this, but I'm not sure if it is possible at 
all).
Context: we have a number of data sets which are essentially already sorted on 
a key. With our current systems, we can take advantage of this to do a lot of 
analysis in a very efficient fashion...merges and joins, for example, can be 
done very efficiently, as can folds on a secondary key and so on.
I was wondering if spark would be a fit for implementing these sorts of 
optimizations? Obviously it is sort of a niche case, but would this be 
achievable? Any pointers on where I should look?

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-11 Thread Imran Rashid

I'm not sure what you mean.   Are you asking how you can recompile all of
spark and deploy it, instead of using one of the pre-built versions?

https://spark.apache.org/docs/latest/building-spark.html

Or are you seeing compile problems specifically w/
HighlyCompressedMapStatus?   The code compiles fine, so I'm not sure what
problem you are running into -- we'd need a lot more info to help

On Tue, Mar 10, 2015 at 6:54 PM, Arun Luthra  wrote:

> Does anyone know how to get the HighlyCompressedMapStatus to compile?
>
> I will try turning off kryo in 1.2.0 and hope things don't break.  I want
> to benefit from the MapOutputTracker fix in 1.2.0.
>
> On Tue, Mar 3, 2015 at 5:41 AM, Imran Rashid  wrote:
>
>> the scala syntax for arrays is Array[T], not T[], so you want to use
>> something:
>>
>> kryo.register(classOf[Array[org.roaringbitmap.RoaringArray$Element]])
>> kryo.register(classOf[Array[Short]])
>>
>> nonetheless, the spark should take care of this itself.  I'll look into
>> it later today.
>>
>>
>> On Mon, Mar 2, 2015 at 2:55 PM, Arun Luthra 
>> wrote:
>>
>>> I think this is a Java vs scala syntax issue. Will check.
>>>
>>> On Thu, Feb 26, 2015 at 8:17 PM, Arun Luthra 
>>> wrote:
>>>
 Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949

 I tried this as a workaround:

 import org.apache.spark.scheduler._
 import org.roaringbitmap._

 ...


 kryo.register(classOf[org.roaringbitmap.RoaringBitmap])
 kryo.register(classOf[org.roaringbitmap.RoaringArray])
 kryo.register(classOf[org.roaringbitmap.ArrayContainer])

 kryo.register(classOf[org.apache.spark.scheduler.HighlyCompressedMapStatus])
 kryo.register(classOf[org.roaringbitmap.RoaringArray$Element])
 kryo.register(classOf[org.roaringbitmap.RoaringArray$Element[]])
 kryo.register(classOf[short[]])


 in build file:

 libraryDependencies += "org.roaringbitmap" % "RoaringBitmap" % "0.4.8"


 This fails to compile:

 ...:53: identifier expected but ']' found.

 [error]
 kryo.register(classOf[org.roaringbitmap.RoaringArray$Element[]])

 also:

 :54: identifier expected but ']' found.

 [error] kryo.register(classOf[short[]])
 also:

 :51: class HighlyCompressedMapStatus in package scheduler cannot be
 accessed in package org.apache.spark.scheduler
 [error]
 kryo.register(classOf[org.apache.spark.scheduler.HighlyCompressedMapStatus])


 Suggestions?

 Arun

>>>
>>>
>>
>

Re: Process time series RDD after sortByKey

2015-03-11 Thread Imran Rashid

this is a very interesting use case.  First of all, its worth pointing out
that if you really need to process the data sequentially, fundamentally you
are limiting the parallelism you can get.  Eg., if you need to process the
entire data set sequentially, then you can't get any parallelism.  If you
can process each hour separately, but need to process data within an hour
sequentially, then the max parallelism you can get for one days is 24.

But lets say you're OK with that.  Zhan Zhang solution is good if you just
want to process the entire dataset sequentially.  But what if you wanted to
process each hour separately, so you at least can create 24 tasks that can
be run in parallel for one day?  I think you would need to create your own
subclass of RDD that is similar in spirit to what CoalescedRDD does.  Your
RDD would have 24 partitions, and each partition would depend on some set
of partitions in its parent (your sorted RDD with 1000 partitions).  I
don't think you could use CoalescedRDD directly b/c you want more control
over the way the partitions get grouped together.

this answer is very similar to my answer to your other question about
controlling partitions , hope its helps! :)

On Mon, Mar 9, 2015 at 5:41 PM, Shuai Zheng  wrote:

> Hi All,
>
>
>
> I am processing some time series data. For one day, it might has 500GB,
> then for each hour, it is around 20GB data.
>
>
>
> I need to sort the data before I start process. Assume I can sort them
> successfully
>
>
>
> *dayRDD.sortByKey*
>
>
>
> but after that, I might have thousands of partitions (to make the sort
> successfully), might be 1000 partitions. And then I try to process the data
> by hour (not need exactly one hour, but some kind of similar time frame).
> And I can’t just re-partition size to 24 because then one partition might
> be too big to fit into memory (if it is 20GB). So is there any way for me
> to just can process underlying partitions by certain order? Basically I
> want to call mapPartitionsWithIndex with a range of index?
>
>
>
> Anyway to do it? Hope I describe my issue clear… J
>
>
>
> Regards,
>
>
>
> Shuai
>
>
>
>
>

RE: Spark SQL using Hive metastore

2015-03-11 Thread java8964

You need to include the Hadoop native library in your spark-shell/spark-sql, 
assuming your hadoop native library including native snappy library.
spark-sql --driver-library-path point_to_your_hadoop_native_library
In spark-sql, you can just use any command as you are in Hive CLI.
Yong

Date: Wed, 11 Mar 2015 21:06:54 +
From: rgra...@yahoo.com.INVALID
To: user@spark.apache.org
Subject: Spark SQL using Hive metastore

Hi guys,
I am a newbie in running Spark SQL / Spark. My goal is to run some TPC-H 
queries atop Spark SQL using Hive metastore. 
It looks like spark 1.2.1 release has Spark SQL / Hive support. However, I am 
not able to fully connect all the dots. 

I did the following: 
1. Copied hive-site.xml from hive to spark/conf2. Copied mysql connector to 
spark/lib3. I have started hive metastore service: hive --service metastore
3. I have started ./bin/spark-sql 
4. I typed: spark-sql> show tables; However, the following error was thrown:  
Job 0 failed: collect at SparkPlan.scala:84, took 0.241788 s15/03/11 15:02:35 
ERROR SparkSQLDriver: Failed in [show tables]org.apache.spark.SparkException: 
Job aborted due to stage failure: Task serialization failed: 
org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native 
library is found for os.name=Linux and os.arch=aarch64
Do  you know what I am doing wrong ? I mention that I have hive-0.14 instead of 
hive-0.13. 

And another question: What is the right command to run sql queries with spark 
sql using hive metastore ?
Thanks,Robert

Re: Top, takeOrdered, sortByKey

2015-03-11 Thread Imran Rashid

I am not entirely sure I understand your question -- are you saying:

* scoring a sample of 50k events is fast
* taking the top N scores of 77M events is slow, no matter what N is

?

if so, this shouldn't come as a huge surprise.  You can't find the top
scoring elements (no matter how small N is) unless you score all 77M of
them.  Very naively, you would expect scoring 77M events to take ~1000
times as long as scoring 50k events, right?  The fact that it doesn't take
that much longer is probably b/c of the overhead of just launching the jobs.



On Mon, Mar 9, 2015 at 4:21 PM, Saba Sehrish  wrote:

>
>
>  *From:* Saba Sehrish 
> *Date:* March 9, 2015 at 4:11:07 PM CDT
> *To:* 
> *Subject:* *Using top, takeOrdered, sortByKey*
>
>   I am using spark for a template matching problem. We have 77 million
> events in the template library, and we compare energy of each of the input
> event with the each of the template event and return a score. In the end we
> return best 1 matches with lowest score. A score of 0 is a perfect
> match.
>
>  I down sampled the problem to use only 50k events. For a single event
> matching across all the events in the template (50k) I see 150-200ms for
> score calculation on 25 cores (using YARN cluster), but after that when I
> perform either a top or takeOrdered or even sortByKey the time reaches to
> 25-50s.
> So far I am not able to figure out why such a huge gap going from a list
> of scores to a list of top 1000 scores and why sorting or getting best X
> matches is being dominant by a large factor. One thing I have noticed is
> that it doesn’t matter how many elements I return the time range is the
> same 25-50s for 10 - 1 elements.
>
>  Any suggestions? if I am not using API properly?
>
>  scores is JavaPairRDD, and I do something like
> numbestmatches is 10, 100, 1 or any number.
>
>   List > bestscores_list =
> scores.takeOrdered(numbestmatches, new TupleComparator());
>  Or
>  List > bestscores_list =
> scores.top(numbestmatches, new TupleComparator());
>  Or
>  List > bestscores_list = scores.sortByKey();
>
>

Fwd: Numbering RDD members Sequentially

2015-03-11 Thread Steve Lewis

-- Forwarded message --
From: Steve Lewis 
Date: Wed, Mar 11, 2015 at 9:13 AM
Subject: Re: Numbering RDD members Sequentially
To: "Daniel, Ronald (ELS-SDG)" 


perfect - exactly what I was looking for, not quite sure why it is
called zipWithIndex
since zipping is not involved
my code does something like this where IMeasuredSpectrum is a large class
we want to set an index for

public static JavaRDD
indexSpectra(JavaRDD pSpectraToScore) {

JavaPairRDD indexed =
pSpectraToScore.zipWithIndex();

pSpectraToScore = indexed.map(new AddIndexToSpectrum()) ;
return pSpectraToScore;
}

public class AddIndexToSpectrum implements
Function, IMeasuredSpectrum>
{
@Override
public IMeasuredSpectrum doCall(final Tuple2 v1) throws Exception {
IMeasuredSpectrum spec = v1._1();
long index = v1._2();
spec.setIndex(   index + 1 );
 return spec;
}

   }

 }


On Wed, Mar 11, 2015 at 6:57 AM, Daniel, Ronald (ELS-SDG) <
r.dan...@elsevier.com> wrote:

>  Have you looked at zipWithIndex?
>
>
>
> *From:* Steve Lewis [mailto:lordjoe2...@gmail.com]
> *Sent:* Tuesday, March 10, 2015 5:31 PM
> *To:* user@spark.apache.org
> *Subject:* Numbering RDD members Sequentially
>
>
>
> I have Hadoop Input Format which reads records and produces
>
>
>
> JavaPairRDD locatedData  where
>
> _1() is a formatted version of the file location - like
>
> "12690",, "24386 ."27523 ...
>
> _2() is data to be processed
>
>
>
> For historical reasons  I want to convert _1() into in integer
> representing the record number.
>
> so keys become "0001", "002" ...
>
>
>
> (Yes I know this cannot be done in parallel) The PairRDD may be too large
> to collect and work on one machine but small enough to handle on a single
> machine.
>  I could use toLocalIterator to guarantee execution on one machine but
> last time I tried this all kinds of jobs were launched to get the next
> element of the iterator and I was not convinced this approach was efficient.
>
>
>

JavaSparkContext - jarOfClass or jarOfObject dont work

2015-03-11 Thread Nirav Patel

Hi I am trying to run my spark service against cluster. As it turns out I
have to do setJars and set my applicaiton jar in there. If I do it using
physical path like following it works
`conf.setJars(new String[]{"/path/to/jar/Sample.jar"});`

but If i try to use JavaSparkContext (or SparkContext) api jarOfClass or
jarOfObject it doesnt work. Basically API cant find jar itself.

Following returns empty 
`JavaSparkContext.jarOfObject(this);
JavaSparkContext.jarOfClass(this.getClass())`

Its an excellent API only if it worked! Any one else able to make use of
this?

-- 


[image: What's New with Xactly] 

[image: Facebook]   [image: LinkedIn] 
  [image: Twitter] 
  [image: YouTube]

Re: How to preserve/preset partition information when load time series data?

2015-03-11 Thread Imran Rashid

It should be *possible* to do what you want ... but if I understand you
right, there isn't really any very easy way to do it.  I think you would
need to write your own subclass of RDD, which has its own logic on how the
input files get put divided among partitions.  You can probably subclass
HadoopRDD and just modify getPartitions().  your logic could look at the
day of each filename to decide which partition it goes into.  You'd need to
make corresponding changes for HadoopPartition & the compute() method.

(or if you can't subclass HadoopRDD directly you can use it for
inspiration.)

On Mon, Mar 9, 2015 at 11:18 AM, Shuai Zheng  wrote:

> Hi All,
>
>
>
> If I have a set of time series data files, they are in parquet format and
> the data for each day are store in naming convention, but I will not know
> how many files for one day.
>
>
>
> 20150101a.parq
>
> 20150101b.parq
>
> 20150102a.parq
>
> 20150102b.parq
>
> 20150102c.parq
>
> …
>
> 201501010a.parq
>
> …
>
>
>
> Now I try to write a program to process the data. And I want to make sure
> each day’s data into one partition, of course I can load all into one big
> RDD to do partition but it will be very slow. As I already know the time
> series of the file name, is it possible for me to load the data into the
> RDD also preserve the partition? I know I can preserve the partition by
> each file, but is it anyway for me to load the RDD and preserve partition
> based on a set of files: one partition multiple files?
>
>
>
> I think it is possible because when I load a RDD from 100 files (assume
> cross 100 days), I will have 100 partitions (if I disable file split when
> load file). Then I can use a special coalesce to repartition the RDD? But I
> don’t know is it possible to do that in current Spark now?
>
>
>
> Regards,
>
>
>
> Shuai
>

Re: SQL with Spark Streaming

2015-03-11 Thread Tobias Pfeiffer

Hi,

On Thu, Mar 12, 2015 at 12:08 AM, Huang, Jie  wrote:
>
> According to my understanding, your approach is to register a series of
> tables by using transformWith, right? And then, you can get a new Dstream
> (i.e., SchemaDstream), which consists of lots of SchemaRDDs.
>

Yep, it's basically the following:

case class SchemaDStream(sqlc: SQLContext,
 dataStream: DStream[Row],
 schemaStream: DStream[StructType]) {
  def registerStreamAsTable(name: String): Unit = {
foreachRDD(_.registerTempTable(name))
  }

  def foreachRDD(func: SchemaRDD => Unit): Unit = {
def executeFunction(dataRDD: RDD[Row], schemaRDD: RDD[StructType]):
RDD[Unit] = {
  val schema: StructType = schemaRDD.collect.head
  val dataWithSchema: SchemaRDD = sqlc.applySchema(dataRDD, schema)
  val result = func(dataWithSchema)
  schemaRDD.map(x => result) // return RDD[Unit]
}
dataStream.transformWith(schemaStream, executeFunction
_).foreachRDD(_.count())
  }

}

In a similar way you could add a `transform(func: SchemaRDD => SchemaRDD)`
method. But as I said, I am not sure about performance.

Tobias

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Tobias Pfeiffer

Hi,

On Wed, Mar 11, 2015 at 11:05 PM, Cesar Flores  wrote:
>
> Thanks for both answers. One final question. *This registerTempTable is
> not an extra process that the SQL queries need to do that may decrease
> performance over the language integrated method calls? *
>

As far as I know, registerTempTable is just a Map[String, SchemaRDD]
insertion, nothing that would be measurable. But there are no
distributed/RDD operations involved, I think.

Tobias

Re: Writing to a single file from multiple executors

2015-03-11 Thread Tathagata Das

Why do you have to write a single file?



On Wed, Mar 11, 2015 at 1:00 PM, SamyaMaiti 
wrote:

> Hi Experts,
>
> I have a scenario, where in I want to write to a avro file from a streaming
> job that reads data from kafka.
>
> But the issue is, as there are multiple executors and when all try to write
> to a given file I get a concurrent exception.
>
> I way to mitigate the issue is to repartition & have a single writer task,
> but as my data is huge that is not a feasible option.
>
> Any suggestions welcomed.
>
> Regards,
> Sam
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Writing-to-a-single-file-from-multiple-executors-tp22003.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen

Thanks Mosharaf, for the quick response! Can you maybe give me some
pointers to an explanation of this strategy? Or elaborate a bit more on it?
Which parts are involved in which way? Where are the time penalties and how
scalable is this implementation?

Thanks again,

Tom

On 11 March 2015 at 16:01, Mosharaf Chowdhury 
wrote:

> Hi Tom,
>
> That's an outdated document from 4/5 years ago.
>
> Spark currently uses a BitTorrent like mechanism that's been tuned for
> datacenter environments.
>
> Mosharaf
> --
> From: Tom 
> Sent: ‎3/‎11/‎2015 4:58 PM
> To: user@spark.apache.org
> Subject: Which strategy is used for broadcast variables?
>
> In "Performance and Scalability of Broadcast in Spark" by Mosharaf
> Chowdhury
> I read that Spark uses HDFS for its broadcast variables. This seems highly
> inefficient. In the same paper alternatives are proposed, among which
> "Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105,
> second paragraph about Broadcast Variables, I read " The value is sent to
> each node only once, using an efficient, BitTorrent-like communication
> mechanism."
>
> - Is the book talking about the proposed BTB from the paper?
>
> - Is this currently the default?
>
> - If not, what is?
>
> Thanks,
>
> Tom
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Which-strategy-is-used-for-broadcast-variables-tp22004.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Yana Kadiyska

You might also want to see if TaskScheduler helps with that. I have not
used it with Windows 2008 R2 but it generally does allow you to schedule a
bat file to run on startup

On Wed, Mar 11, 2015 at 10:16 AM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:
>
> Thanks for the suggestion. I will try that.
>
>
>
> Ningjun
>
>
>
>
>
> From: Silvio Fiorito [mailto:silvio.fior...@granturing.com]
> Sent: Wednesday, March 11, 2015 12:40 AM
> To: Wang, Ningjun (LNG-NPV); user@spark.apache.org
> Subject: Re: Is it possible to use windows service to start and stop
spark standalone cluster
>
>
>
> Have you tried Apache Daemon?
http://commons.apache.org/proper/commons-daemon/procrun.html
>
>
>
> From: , "Ningjun (LNG-NPV)"
> Date: Tuesday, March 10, 2015 at 11:47 PM
> To: "user@spark.apache.org"
> Subject: Is it possible to use windows service to start and stop spark
standalone cluster
>
>
>
> We are using spark stand alone cluster on Windows 2008 R2. I can start
spark clusters by open an command prompt and run the following
>
>
>
> bin\spark-class.cmd org.apache.spark.deploy.master.Master
>
>
>
> bin\spark-class.cmd org.apache.spark.deploy.worker.Worker spark://
mywin.mydomain.com:7077
>
>
>
> I can stop spark cluster by pressing Ctril-C.
>
>
>
> The problem is that if the machine is reboot, I have to manually start
the spark cluster again as above. Is it possible to use a windows service
to start cluster? This way when the machine is reboot, the windows service
will automatically restart spark cluster. How to stop spark cluster using
windows service is also a challenge.
>
>
>
> Please advise.
>
>
>
> Thanks
>
>
>
> Ningjun

PySpark: Python 2.7 cluster installation script (with Numpy, IPython, etc)

2015-03-11 Thread Sebastián Ramírez

Many times, when I'm setting up a cluster, I have to use an operating
system (as RedHat or CentOS 6.5) which has an old version of Python (Python
2.6). For example, when using a Hadoop distribution that only supports
those operating systems (as Hortonworks' HDP or Cloudera).

 And that also makes installing additional advanced Python packages
difficult (such as Numpy, IPython, etc).

Then I tend to use Anaconda Python, an open source version of Python with
many of those packages pre-built and pre-installed.

But installing Anaconda in each node of the cluster might be tedious.

So I made a *simple script which helps installing Anaconda Python in the
machines of a cluster *more easily.

I wanted to share it here, in case it can help someone wanting using
PySpark.

https://github.com/tiangolo/anaconda_cluster_install


*Sebastián Ramírez*
Head of Software Development

 

 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 73 No 7 - 06  Piso 4
 Linkedin: co.linkedin.com/in/tiangolo/
 Twitter: @tiangolo 
 Email: sebastian.rami...@senseta.com
 www.senseta.com

-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

can spark take advantage of ordered data?

2015-03-11 Thread Jonathan Coveney

Hello all,

I am wondering if spark already has support for optimizations on sorted
data and/or if such support could be added (I am comfortable dropping to a
lower level if necessary to implement this, but I'm not sure if it is
possible at all).

Context: we have a number of data sets which are essentially already sorted
on a key. With our current systems, we can take advantage of this to do a
lot of analysis in a very efficient fashion...merges and joins, for
example, can be done very efficiently, as can folds on a secondary key and
so on.

I was wondering if spark would be a fit for implementing these sorts of
optimizations? Obviously it is sort of a niche case, but would this be
achievable? Any pointers on where I should look?

SVD transform of large matrix with MLlib

2015-03-11 Thread sergunok

Does somebody used SVD from MLlib for very large (like 10^6 x 10^7) sparse
matrix?
What time did it take?
What implementation of SVD is used in MLLib? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SVD-transform-of-large-matrix-with-MLlib-tp22005.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL using Hive metastore

2015-03-11 Thread Grandl Robert

Hi guys,
I am a newbie in running Spark SQL / Spark. My goal is to run some TPC-H 
queries atop Spark SQL using Hive metastore. 
It looks like spark 1.2.1 release has Spark SQL / Hive support. However, I am 
not able to fully connect all the dots. 

I did the following: 
1. Copied hive-site.xml from hive to spark/conf2. Copied mysql connector to 
spark/lib3. I have started hive metastore service: hive --service metastore
3. I have started ./bin/spark-sql 
4. I typed: spark-sql> show tables; However, the following error was thrown:  
Job 0 failed: collect at SparkPlan.scala:84, took 0.241788 s
15/03/11 15:02:35 ERROR SparkSQLDriver: Failed in [show tables]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
serialization failed: org.xerial.snappy.SnappyError: 
[FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Linux 
and os.arch=aarch64

Do  you know what I am doing wrong ? I mention that I have hive-0.14 instead of 
hive-0.13. 

And another question: What is the right command to run sql queries with spark 
sql using hive metastore ?
Thanks,Robert

RE: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury

Hi Tom,

That's an outdated document from 4/5 years ago. 

Spark currently uses a BitTorrent like mechanism that's been tuned for 
datacenter environments. 

Mosharaf

-Original Message-
From: "Tom" 
Sent: ‎3/‎11/‎2015 4:58 PM
To: "user@spark.apache.org" 
Subject: Which strategy is used for broadcast variables?

In "Performance and Scalability of Broadcast in Spark" by Mosharaf Chowdhury
I read that Spark uses HDFS for its broadcast variables. This seems highly
inefficient. In the same paper alternatives are proposed, among which
"Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105,
second paragraph about Broadcast Variables, I read " The value is sent to
each node only once, using an efficient, BitTorrent-like communication
mechanism." 

- Is the book talking about the proposed BTB from the paper? 

- Is this currently the default? 

- If not, what is?

Thanks,

Tom



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Which-strategy-is-used-for-broadcast-variables-tp22004.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Which strategy is used for broadcast variables?

2015-03-11 Thread Tom

In "Performance and Scalability of Broadcast in Spark" by Mosharaf Chowdhury
I read that Spark uses HDFS for its broadcast variables. This seems highly
inefficient. In the same paper alternatives are proposed, among which
"Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105,
second paragraph about Broadcast Variables, I read " The value is sent to
each node only once, using an efficient, BitTorrent-like communication
mechanism." 

- Is the book talking about the proposed BTB from the paper? 

- Is this currently the default? 

- If not, what is?

Thanks,

Tom



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Which-strategy-is-used-for-broadcast-variables-tp22004.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Writing to a single file from multiple executors

2015-03-11 Thread SamyaMaiti

Hi Experts,

I have a scenario, where in I want to write to a avro file from a streaming
job that reads data from kafka.

But the issue is, as there are multiple executors and when all try to write
to a given file I get a concurrent exception.

I way to mitigate the issue is to repartition & have a single writer task,
but as my data is huge that is not a feasible option.

Any suggestions welcomed.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-to-a-single-file-from-multiple-executors-tp22003.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier

Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)).

> On 11.03.2015, at 18:35, Marius Soutier  wrote:
> 
> Hi,
> 
> I’ve written a Spark Streaming Job that inserts into a Parquet, using 
> stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added 
> checkpointing; everything works fine when starting from scratch. When 
> starting from a checkpoint however, the job doesn’t work and produces the 
> following exception in the foreachRDD:
> 
> ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job 
> streaming job 142609383 ms.2
> org.apache.spark.SparkException: RDD transformations and actions can only be 
> invoked by the driver, not inside of other transformations; for example, 
> rdd1.map(x => rdd2.values.count() * x) is invalid because the values 
> transformation and count action cannot be performed inside of the rdd1.map 
> transformation. For more information, see SPARK-5063.
>   at org.apache.spark.rdd.RDD.sc(RDD.scala:90)
>   at org.apache.spark.rdd.RDD.(RDD.scala:143)
>   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
>   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:114)
>   at MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
>   at MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> 
> 
> 
> 
> Cheers
> - Marius
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to read from hdfs using spark-shell in Intel hadoop?

2015-03-11 Thread Arush Kharbanda

You can add resolvers on SBT using

resolvers +=
  "Sonatype OSS Snapshots" at
"https://oss.sonatype.org/content/repositories/snapshots";


On Thu, Feb 26, 2015 at 4:09 PM, MEETHU MATHEW 
wrote:

> Hi,
>
> I am not able to read from HDFS(Intel distribution hadoop,Hadoop version
> is 1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the
> command
> mvn -Dhadoop.version=1.0.3 clean package and started  spark-shell and read
> a HDFS file using sc.textFile() and the exception is
>
>  WARN hdfs.DFSClient: Failed to connect to /10.88.6.133:50010, add to
> deadNodes and continuejava.net.SocketTimeoutException: 12 millis
> timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.88.6.131:44264
> remote=/10.88.6.133:50010]
>
> The same problem is asked in the this mail.
>  RE: Spark is unable to read from HDFS
> 
>
>
>
>
>
>
> RE: Spark is unable to read from HDFS
> 
> Hi, Thanks for the reply. I've tried the below.
> View on mail-archives.us.apache.org
> 
> Preview by Yahoo
>
>
> As suggested in the above mail,
> *"In addition to specifying HADOOP_VERSION=1.0.3 in the
> ./project/SparkBuild.scala file, you will need to specify the
> libraryDependencies and name "spark-core"  resolvers. Otherwise, sbt will
> fetch version 1.0.3 of hadoop-core from apache instead of Intel. You can
> set up your own local or remote repository that you specify" *
>
> Now HADOOP_VERSION is deprecated and -Dhadoop.version should be used. Can
> anybody please elaborate on how to specify tat SBT should fetch hadoop-core
> from Intel which is in our internal repository?
>
> Thanks & Regards,
> Meethu M
>



-- 

[image: Sigmoid Analytics] 

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-11 Thread Michael Armbrust

That val is not really your problem.  In general, there is a lot of global
state throughout the hive codebase that make it unsafe to try and connect
to more than one hive installation from the same JVM.

On Tue, Mar 10, 2015 at 11:36 PM, Haopu Wang  wrote:

>  Hao, thanks for the response.
>
>
>
> For Q1, in my case, I have a tool on SparkShell which serves multiple
> users where they can use different Hive installation. I take a look at the
> code of HiveContext. It looks like I cannot do that today because "catalog"
> field cannot be changed after initialize.
>
>
>
>   /* A catalyst metadata catalog that points to the Hive Metastore. */
>
>   @transient
>
>   *override* *protected*[sql] *lazy* *val* catalog = *new*
> HiveMetastoreCatalog(*this*) *with* OverrideCatalog
>
>
>
> For Q2, I check HDFS and it is running as a cluster. I can run the DDL
> from spark shell with HiveContext as well. To reproduce the exception, I
> just run below script. It happens in the last step.
>
>
>
> 15/03/11 14:24:48 INFO SparkILoop: Created sql context (with Hive
> support)..
>
> SQL context available as sqlContext.
>
> scala> sqlContext.sql("SET
> hive.metastore.warehouse.dir=hdfs://server:8020/space/warehouse")
>
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS src(key INT, value
> STRING)")
>
> scala> sqlContext.sql("LOAD DATA LOCAL INPATH
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>
> scala> var output = sqlContext.sql("SELECT key,value FROM src")
>
> scala> output.saveAsTable("outputtable")
>
>
>  --
>
> *From:* Cheng, Hao [mailto:hao.ch...@intel.com]
> *Sent:* Wednesday, March 11, 2015 8:25 AM
> *To:* Haopu Wang; user; d...@spark.apache.org
> *Subject:* RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?
>
>
>
> I am not so sure if Hive supports change the metastore after initialized,
> I guess not. Spark SQL totally rely on Hive Metastore in HiveContext,
> probably that’s why it doesn’t work as expected for Q1.
>
>
>
> BTW, in most of cases, people configure the metastore settings in
> hive-site.xml, and will not change that since then, is there any reason
> that you want to change that in runtime?
>
>
>
> For Q2, probably something wrong in configuration, seems the HDFS run into
> the pseudo/single node mode, can you double check that? Or can you run the
> DDL (like create a table) from the spark shell with HiveContext?
>
>
>
> *From:* Haopu Wang [mailto:hw...@qilinsoft.com]
> *Sent:* Tuesday, March 10, 2015 6:38 PM
> *To:* user; d...@spark.apache.org
> *Subject:* [SparkSQL] Reuse HiveContext to different Hive warehouse?
>
>
>
> I'm using Spark 1.3.0 RC3 build with Hive support.
>
>
>
> In Spark Shell, I want to reuse the HiveContext instance to different
> warehouse locations. Below are the steps for my test (Assume I have loaded
> a file into table "src").
>
>
>
> ==
>
> 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive
> support)..
>
> SQL context available as sqlContext.
>
> scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w")
>
> scala> sqlContext.sql("SELECT * from src").saveAsTable("table1")
>
> scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w2")
>
> scala> sqlContext.sql("SELECT * from src").saveAsTable("table2")
>
> ==
>
> After these steps, the tables are stored in "/test/w" only. I expect
> "table2" to be stored in "/test/w2" folder.
>
>
>
> Another question is: if I set "hive.metastore.warehouse.dir" to a HDFS
> folder, I cannot use saveAsTable()? Is this by design? Exception stack
> trace is below:
>
> ==
>
> 15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block
> broadcast_0_piece0
>
> 15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at
> TableReader.scala:74
>
> java.lang.IllegalArgumentException: Wrong FS:
> hdfs://server:8020/space/warehouse/table2, expected: file:///
>
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
>
> at
> org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:118)
>
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)
>
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at scala.collection.immutable.List.foreach(List.scala:318)
>
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)
>
> at
> org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scal

Re: Architecture Documentation

2015-03-11 Thread Vijay Saraswat

I've looked at the Zaharia PhD thesis to get an idea of the underlying 
system.


http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.html

On 3/11/15 12:48 PM, Mohit Anchlia wrote:
Is there a good architecture doc that gives a sufficient overview of 
high level and low level details of spark with some good diagrams?



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier

Hi,

I’ve written a Spark Streaming Job that inserts into a Parquet, using 
stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added 
checkpointing; everything works fine when starting from scratch. When starting 
from a checkpoint however, the job doesn’t work and produces the following 
exception in the foreachRDD:

ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job 
streaming job 142609383 ms.2
org.apache.spark.SparkException: RDD transformations and actions can only be 
invoked by the driver, not inside of other transformations; for example, 
rdd1.map(x => rdd2.values.count() * x) is invalid because the values 
transformation and count action cannot be performed inside of the rdd1.map 
transformation. For more information, see SPARK-5063.
at org.apache.spark.rdd.RDD.sc(RDD.scala:90)
at org.apache.spark.rdd.RDD.(RDD.scala:143)
at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:114)
at MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
at MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)




Cheers
- Marius


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Scaling problem in RandomForest?

2015-03-11 Thread insperatum

Hi, the Random Forest implementation (1.2.1) is repeatably crashing when I
increase the depth to 20. I generate random synthetic data (36 workers,
1,000,000 examples per worker, 30 features per example) as follows:

val data = sc.parallelize(1 to 36, 36).mapPartitionsWithIndex((i, _) =>
{
  Array.tabulate(100){ _ =>
new LabeledPoint(Math.random(),
Vectors.dense(Array.fill(30)(math.random)))
  }.toIterator
}).cache()

...and then train on a Random Forest with 50 trees, to depth 20:

val strategy = new Strategy(Regression, Variance, 20, maxMemoryInMB =
1000)
RandomForest.trainRegressor(data, strategy, 50, "sqrt", 1)

...and run on my EC2 cluster (36 slaves, master has 122GB of memory). After
number crunching for a couple of hours, I get the following error:

[sparkDriver-akka.actor.default-dispatcher-3] shutting down ActorSystem
[sparkDriver]
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:834)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
at
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/03/11 15:45:51 INFO scheduler.DAGScheduler: Job 92 failed: collectAsMap
at DecisionTree.scala:653, took 46.062487 s



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scaling-problem-in-RandomForest-tp22002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Architecture Documentation

2015-03-11 Thread Mohit Anchlia

Is there a good architecture doc that gives a sufficient overview of high
level and low level details of spark with some good diagrams?

Re: PairRDD serialization exception

2015-03-11 Thread Manas Kar

Hi Sean,
Below is the sbt dependencies that I am using.

I gave another try by removing the "provided" keyword which failed with the
same error.
What confuses me is that the stack trace appears after few of the stages
have already run completely.

  object V {
val spark = "1.2.0-cdh5.3.0"
val esriGeometryAPI = "1.2"
val csvWriter = "1.0.0"
val hadoopClient = "2.5.0"
val scalaTest = "2.2.1"
val jodaTime = "1.6.0"
val scalajHTTP = "1.0.1"
val avro   = "1.7.7"
val scopt  = "3.2.0"
val breeze = "0.8.1"
val config = "1.2.1"
  }
  object Libraries {
val EEAMessage  = "com.waterloopublic" %% "eeaformat" %
"1.0-SNAPSHOT"
val avro= "org.apache.avro" % "avro-mapred" %
V.avro classifier "hadoop2"
val spark  = "org.apache.spark" % "spark-core_2.10" %
V.spark % "provided"
val hadoopClient= "org.apache.hadoop" % "hadoop-client" %
V.hadoopClient % "provided"
val esriGeometryAPI  = "com.esri.geometry" % "esri-geometry-api" %
V.esriGeometryAPI
val scalaTest = "org.scalatest" %% "scalatest" %
V.scalaTest % "test"
val csvWriter  = "com.github.tototoshi" %% "scala-csv" %
V.csvWriter
val jodaTime   = "com.github.nscala-time" %% "nscala-time"
% V.jodaTime % "provided"
val scalajHTTP= "org.scalaj" %% "scalaj-http" % V.scalajHTTP
val scopt= "com.github.scopt" %% "scopt" % V.scopt
val breeze  = "org.scalanlp" %% "breeze" % V.breeze
val breezeNatives   = "org.scalanlp" %% "breeze-natives" % V.breeze
val config  = "com.typesafe" % "config" % V.config
  }

There are only few more things to try(like reverting back to Spark 1.1)
before I run out of idea completely.
Please share your insights.

..Manas

On Wed, Mar 11, 2015 at 9:44 AM, Sean Owen  wrote:

> This usually means you are mixing different versions of code. Here it
> is complaining about a Spark class. Are you sure you built vs the
> exact same Spark binaries, and are not including them in your app?
>
> On Wed, Mar 11, 2015 at 1:40 PM, manasdebashiskar
>  wrote:
> > (This is a repost. May be a simpler subject will fetch more attention
> among
> > experts)
> >
> > Hi,
> >  I have a CDH5.3.2(Spark1.2) cluster.
> >  I am getting an local class incompatible exception for my spark
> application
> > during an action.
> > All my classes are case classes(To best of my knowledge)
> >
> > Appreciate any help.
> >
> > Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due
> > to stage failure: Task 0 in stage 3.0 failed 4 times, most recent
> failure:
> > Lost task 0.3 in stage 3.0 (TID 346, datanode02):
> > java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions;
> local
> > class incompatible:stream classdesc serialVersionUID =
> 8789839749593513237,
> > local class serialVersionUID = -4145741279224749316
> > at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
> > at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
> > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> > at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> > at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> > at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> > at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> > at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> > at
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> > at
> >
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
> > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
> > at org.apache.spark.scheduler.Task.run(Task.scala:56)
> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:745)
> >
> >
> > Thanks
> > Manas
> > Manas Kar
> >
> > 
> > View this message in context: PairRDD serialization exception
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Read parquet folders recursively

2015-03-11 Thread Masf

Hi all

Is it possible to read recursively folders to read parquet files?


Thanks.

-- 


Saludos.
Miguel Ángel

Re: Joining data using Latitude, Longitude

2015-03-11 Thread Ankur Srivastava

Thank you everyone!! I have started implementing the join using the geohash
and using the first 4 alphabets of the HASH as the key.

Can I assign a Confidence factor in terms of distance based on number of
characters matching in the HASH code?

I will also look at the other options listed here.

Thanks
Ankur

On Wed, Mar 11, 2015, 6:18 AM Manas Kar  wrote:

> There are few techniques currently available.
> Geomesa which uses GeoHash also can be proved useful.(
> https://github.com/locationtech/geomesa)
>
> Other potential candidate is
> https://github.com/Esri/gis-tools-for-hadoop especially
> https://github.com/Esri/geometry-api-java for inner customization.
>
> If you want to ask questions like nearby me then these are the basic steps.
> 1) Index your geometry data which uses R-Tree.
> 2) Write your joiner logic that takes advantage of the index tree to get
> you faster access.
>
> Thanks
> Manas
>
>
> On Wed, Mar 11, 2015 at 5:55 AM, Andrew Musselman <
> andrew.mussel...@gmail.com> wrote:
>
>> Ted Dunning and Ellen Friedman's "Time Series Databases" has a section on
>> this with some approaches to geo-encoding:
>>
>> https://www.mapr.com/time-series-databases-new-ways-store-and-access-data
>> http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf
>>
>> On Tue, Mar 10, 2015 at 3:53 PM, John Meehan  wrote:
>>
>>> There are some techniques you can use If you geohash
>>>  the lat-lngs.  They will
>>> naturally be sorted by proximity (with some edge cases so watch out).  If
>>> you go the join route, either by trimming the lat-lngs or geohashing them,
>>> you’re essentially grouping nearby locations into buckets — but you have to
>>> consider the borders of the buckets since the nearest location may actually
>>> be in an adjacent bucket.  Here’s a paper that discusses an implementation:
>>> http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf
>>>
>>> On Mar 9, 2015, at 11:42 PM, Akhil Das 
>>> wrote:
>>>
>>> Are you using SparkSQL for the join? In that case I'm not quiet sure you
>>> have a lot of options to join on the nearest co-ordinate. If you are using
>>> the normal Spark code (by creating key-pair on lat,lon) you can apply
>>> certain logic like trimming the lat,lon etc. If you want more specific
>>> computing then you are better off using haversine formula.
>>> 
>>>
>>>
>>>
>>
>

bad symbolic reference. A signature in SparkContext.class refers to term conf in value org.apache.hadoop which is not available

2015-03-11 Thread Patcharee Thongtra


Hi,

I have built spark version 1.3 and tried to use this in my spark scala 
application. When I tried to compile and build the application by SBT, I 
got error>
bad symbolic reference. A signature in SparkContext.class refers to term 
conf in value org.apache.hadoop which is not available


It seems hadoop library is missing, but it should be referred 
automatically by SBT, isn't it.


This application is buit-able on spark version 1.2

Here is my build.sbt

name := "wind25t-v013"
version := "0.1"
scalaVersion := "2.10.4"
unmanagedBase := baseDirectory.value / "lib"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.3.0"
libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.3.0"

What should I do to fix it?

BR,
Patcharee




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SVM questions (data splitting, SVM parameters)

2015-03-11 Thread Sean Owen

Jackknife / leave-one-out is just a special case of bootstrapping. In
general I don't think you'd ever do it this way with large data, as
you are creating N-1 models, which could be billions. It's suitable
when you have very small data.

So I suppose I'd say the general bootstrapping you see in this example
is what you want anyway, if you're using Spark.

That said, you could filter() out a particular element of the RDD,
train, compare, and repeat, if you really wanted to. It's just a few
more lines of code over what you see here.

You can repeat the model build / eval process for using simple grid
search, sure. Just:

for (
kernel <- ...;
cost <- ...;
gamma <- ...) yield {

   // build and eval model

   ((kernel, cost, gamma), eval)
}

and that gives you all evaluations for all combinations of hyperparams.
It's brute-force but certainly simple and easy.

On Wed, Mar 11, 2015 at 3:18 PM, Natalia Connolly
 wrote:
> Hello,
>
>I am new to Spark and I am evaluating its suitability for my machine
> learning tasks.  I am using Spark v. 1.2.1.  I would really appreciate if
> someone could provide any insight about the following two issues.
>
>  1.  I'd like to try a "leave one out" approach for training my SVM, meaning
> that all but one data points are used for training.  The example SVM
> classifier code on the Spark webpage has this:
>
> JavaRDD data = MLUtils.loadLibSVMFile(sc, path).toJavaRDD();
>
> JavaRDD training = data.sample(false, 0.6, 11L);
> training.cache();
> JavaRDD test = data.subtract(training);
>
>   Is there a way to iterate over data and progressively remove each element
> in order to designate the rest of the dataset as training, instead of using
> a certain fraction of all the data for training (60% in the above example)?
>
> 2.  Is there a way to choose and vary the parameters of the SVM?  (kernel,
> cost, gamma…)
>
> Thank you!
>
> Natalia

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SVM questions (data splitting, SVM parameters)

2015-03-11 Thread Natalia Connolly

Hello,

   I am new to Spark and I am evaluating its suitability for my machine
learning tasks.  I am using Spark v. 1.2.1.  I would really appreciate if
someone could provide any insight about the following two issues.

 1.  I'd like to try a "leave one out" approach for training my SVM,
meaning that all but one data points are used for training.  The example
SVM classifier code on the Spark webpage has this:

JavaRDD data = MLUtils.loadLibSVMFile(sc, path).toJavaRDD();

JavaRDD training = data.sample(false, 0.6, 11L);
training.cache();
JavaRDD test = data.subtract(training);

  Is there a way to iterate over data and progressively remove each element
in order to designate the rest of the dataset as training, instead of using
a certain fraction of all the data for training (60% in the above example)?


2.  Is there a way to choose and vary the parameters of the SVM?  (kernel,
cost, gamma…)

Thank you!

Natalia

Re: Define exception handling on lazy elements?

2015-03-11 Thread Sean Owen

Hm, but you already only have to define it in one place, rather than
on each transformation. I thought you wanted exception handling at
each transformation?

Or do you want it once for all actions? you can enclose all actions in
a try-catch block, I suppose, to write exception handling code once.
You can easily write a Scala construct that takes a function and logs
exceptions it throws, and the function you pass can invoke an RDD
action. So you can refactor that way too.

On Wed, Mar 11, 2015 at 2:39 PM, Michal Klos  wrote:
> Is there a way to have the exception handling go lazily along with the
> definition?
>
> e.g... we define it on the RDD but then our exception handling code gets
> triggered on that first action... without us having to define it on the
> first action? (e.g. that RDD code is boilerplate and we want to just have it
> in many many projects)
>
> m
>
> On Wed, Mar 11, 2015 at 10:08 AM, Sean Owen  wrote:
>>
>> Handling exceptions this way means handling errors on the driver side,
>> which may or may not be what you want. You can also write functions
>> with exception handling inside, which could make more sense in some
>> cases (like, to ignore bad records or count them or something).
>>
>> If you want to handle errors at every step on the driver side, you
>> have to force RDDs to materialize to see if they "work". You can do
>> that with .count() or .take(1).length > 0. But to avoid recomputing
>> the RDD then, it needs to be cached. So there is a big non-trivial
>> overhead to approaching it this way.
>>
>> If you go this way, consider materializing only a few key RDDs in your
>> flow, not every one.
>>
>> The most natural thing is indeed to handle exceptions where the action
>> occurs.
>>
>>
>> On Wed, Mar 11, 2015 at 1:51 PM, Michal Klos 
>> wrote:
>> > Hi Spark Community,
>> >
>> > We would like to define exception handling behavior on RDD instantiation
>> > /
>> > build. Since the RDD is lazily evaluated, it seems like we are forced to
>> > put
>> > all exception handling in the first action call?
>> >
>> > This is an example of something that would be nice:
>> >
>> > def myRDD = {
>> > Try {
>> > val rdd = sc.textFile(...)
>> > } match {
>> > Failure(e) => Handle ...
>> > }
>> > }
>> >
>> > myRDD.reduceByKey(...) //don't need to worry about that exception here
>> >
>> > The reason being that we want to try to avoid having to copy paste
>> > exception
>> > handling boilerplate on every first action. We would love to define this
>> > once somewhere for the RDD build code and just re-use.
>> >
>> > Is there a best practice for this? Are we missing something here?
>> >
>> > thanks,
>> > Michal
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos

Well I'm thinking that this RDD would fail to build in a specific way...
 different from the subsequent code (e.g. s3 access denied or timeout on
connecting to a database)

So for example, define the RDD failure handling on the RDD, define the
action failure handling on the action? Does this make sense.. otherwise...
on that first action, we have to handle exceptions for all of the lazy
elements that preceded it.. and that could be a lot of stuff.

If the RDD failure handling code is defined with the RDD, it just seems
cleaner because it's right next to its element. Not to mention, we believe
it would be easier to import it into multiple spark jobs without a lot of
copy pasta

m

On Wed, Mar 11, 2015 at 10:45 AM, Sean Owen  wrote:

> Hm, but you already only have to define it in one place, rather than
> on each transformation. I thought you wanted exception handling at
> each transformation?
>
> Or do you want it once for all actions? you can enclose all actions in
> a try-catch block, I suppose, to write exception handling code once.
> You can easily write a Scala construct that takes a function and logs
> exceptions it throws, and the function you pass can invoke an RDD
> action. So you can refactor that way too.
>
> On Wed, Mar 11, 2015 at 2:39 PM, Michal Klos 
> wrote:
> > Is there a way to have the exception handling go lazily along with the
> > definition?
> >
> > e.g... we define it on the RDD but then our exception handling code gets
> > triggered on that first action... without us having to define it on the
> > first action? (e.g. that RDD code is boilerplate and we want to just
> have it
> > in many many projects)
> >
> > m
> >
> > On Wed, Mar 11, 2015 at 10:08 AM, Sean Owen  wrote:
> >>
> >> Handling exceptions this way means handling errors on the driver side,
> >> which may or may not be what you want. You can also write functions
> >> with exception handling inside, which could make more sense in some
> >> cases (like, to ignore bad records or count them or something).
> >>
> >> If you want to handle errors at every step on the driver side, you
> >> have to force RDDs to materialize to see if they "work". You can do
> >> that with .count() or .take(1).length > 0. But to avoid recomputing
> >> the RDD then, it needs to be cached. So there is a big non-trivial
> >> overhead to approaching it this way.
> >>
> >> If you go this way, consider materializing only a few key RDDs in your
> >> flow, not every one.
> >>
> >> The most natural thing is indeed to handle exceptions where the action
> >> occurs.
> >>
> >>
> >> On Wed, Mar 11, 2015 at 1:51 PM, Michal Klos 
> >> wrote:
> >> > Hi Spark Community,
> >> >
> >> > We would like to define exception handling behavior on RDD
> instantiation
> >> > /
> >> > build. Since the RDD is lazily evaluated, it seems like we are forced
> to
> >> > put
> >> > all exception handling in the first action call?
> >> >
> >> > This is an example of something that would be nice:
> >> >
> >> > def myRDD = {
> >> > Try {
> >> > val rdd = sc.textFile(...)
> >> > } match {
> >> > Failure(e) => Handle ...
> >> > }
> >> > }
> >> >
> >> > myRDD.reduceByKey(...) //don't need to worry about that exception here
> >> >
> >> > The reason being that we want to try to avoid having to copy paste
> >> > exception
> >> > handling boilerplate on every first action. We would love to define
> this
> >> > once somewhere for the RDD build code and just re-use.
> >> >
> >> > Is there a best practice for this? Are we missing something here?
> >> >
> >> > thanks,
> >> > Michal
> >
> >
>

Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai

Sorry typo; should be https://github.com/intel-spark/stream-sql

Thanks,
-Jason

On Wed, Mar 11, 2015 at 10:19 PM, Irfan Ahmad 
wrote:

> Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql
>
>
> *Irfan Ahmad*
> CTO | Co-Founder | *CloudPhysics* 
> Best of VMworld Finalist
> Best Cloud Management Award
> NetworkWorld 10 Startups to Watch
> EMA Most Notable Vendor
>
> On Wed, Mar 11, 2015 at 6:41 AM, Jason Dai  wrote:
>
>> Yes, a previous prototype is available
>> https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at
>> last year's Spark Summit (
>> http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark
>> )
>>
>> We are currently porting the prototype to use the latest DataFrame API,
>> and will provide a stable version for people to try soon.
>>
>> Thabnks,
>> -Jason
>>
>>
>> On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer 
>> wrote:
>>
>>> Hi,
>>>
>>> On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao  wrote:
>>>
  Intel has a prototype for doing this, SaiSai and Jason are the
 authors. Probably you can ask them for some materials.

>>>
>>> The github repository is here: https://github.com/intel-spark/stream-sql
>>>
>>> Also, what I did is writing a wrapper class SchemaDStream that
>>> internally holds a DStream[Row] and a DStream[StructType] (the latter
>>> having just one element in every RDD) and then allows to do
>>> - operations SchemaRDD => SchemaRDD using
>>> `rowStream.transformWith(schemaStream, ...)`
>>> - in particular you can register this stream's data as a table this way
>>> - and via a companion object with a method `fromSQL(sql: String):
>>> SchemaDStream` you can get a new stream from previously registered tables.
>>>
>>> However, you are limited to batch-internal operations, i.e., you can't
>>> aggregate across batches.
>>>
>>> I am not able to share the code at the moment, but will within the next
>>> months. It is not very advanced code, though, and should be easy to
>>> replicate. Also, I have no idea about the performance of transformWith
>>>
>>> Tobias
>>>
>>>
>>
>

Re: GraphX Snapshot Partitioning

2015-03-11 Thread Matthew Bucci

Hi,

Thanks for the response! That answered some questions I had, but the last
one I was wondering is what happens if you run a partition strategy and one
of the partitions ends up being too large? For example, let's say
partitions can hold 64MB (actually knowing the maximum possible size of a
partition would probably also be helpful to me). You try to partition the
edges of a graph to 3 separate partitions but the edges in the first
partition end up being 80MB worth of edges so it cannot all fit in the
first partition . Would the extra 16MB flood over into a new 4th partition
or would the system try to split it so that the 1st and 4th partition are
both at 40MB, or would the partition strategy just fail with a memory
error?

Thank You,
Matthew Bucci

On Mon, Mar 9, 2015 at 11:07 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Vertices are simply hash-paritioned by their 64-bit IDs, so
> they are evenly spread over parititons.
>
> As for edges, GraphLoader#edgeList builds edge paritions
> through hadoopFile(), so the initial parititons depend
> on InputFormat#getSplits implementations
> (e.g, partitions are mostly equal to 64MB blocks for HDFS).
>
> Edges can be re-partitioned by ParititonStrategy;
> a graph is partitioned considering graph structures and
> a source ID and a destination ID are used as partition keys.
> The partitions might suffer from skewness depending
> on graph properties (hub nodes, or something).
>
> Thanks,
> takeshi
>
>
> On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci 
> wrote:
>
>> Hello,
>>
>> I am working on a project where we want to split graphs of data into
>> snapshots across partitions and I was wondering what would happen if one
>> of
>> the snapshots we had was too large to fit into a single partition. Would
>> the
>> snapshot be split over the two partitions equally, for example, and how
>> is a
>> single snapshot spread over multiple partitions?
>>
>> Thank You,
>> Matthew Bucci
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos

Is there a way to have the exception handling go lazily along with the
definition?

e.g... we define it on the RDD but then our exception handling code gets
triggered on that first action... without us having to define it on the
first action? (e.g. that RDD code is boilerplate and we want to just have
it in many many projects)

m

On Wed, Mar 11, 2015 at 10:08 AM, Sean Owen  wrote:

> Handling exceptions this way means handling errors on the driver side,
> which may or may not be what you want. You can also write functions
> with exception handling inside, which could make more sense in some
> cases (like, to ignore bad records or count them or something).
>
> If you want to handle errors at every step on the driver side, you
> have to force RDDs to materialize to see if they "work". You can do
> that with .count() or .take(1).length > 0. But to avoid recomputing
> the RDD then, it needs to be cached. So there is a big non-trivial
> overhead to approaching it this way.
>
> If you go this way, consider materializing only a few key RDDs in your
> flow, not every one.
>
> The most natural thing is indeed to handle exceptions where the action
> occurs.
>
>
> On Wed, Mar 11, 2015 at 1:51 PM, Michal Klos 
> wrote:
> > Hi Spark Community,
> >
> > We would like to define exception handling behavior on RDD instantiation
> /
> > build. Since the RDD is lazily evaluated, it seems like we are forced to
> put
> > all exception handling in the first action call?
> >
> > This is an example of something that would be nice:
> >
> > def myRDD = {
> > Try {
> > val rdd = sc.textFile(...)
> > } match {
> > Failure(e) => Handle ...
> > }
> > }
> >
> > myRDD.reduceByKey(...) //don't need to worry about that exception here
> >
> > The reason being that we want to try to avoid having to copy paste
> exception
> > handling boilerplate on every first action. We would love to define this
> > once somewhere for the RDD build code and just re-use.
> >
> > Is there a best practice for this? Are we missing something here?
> >
> > thanks,
> > Michal
>

Re: SQL with Spark Streaming

2015-03-11 Thread Irfan Ahmad

Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql


*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics* 
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 11, 2015 at 6:41 AM, Jason Dai  wrote:

> Yes, a previous prototype is available
> https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at
> last year's Spark Summit (
> http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark
> )
>
> We are currently porting the prototype to use the latest DataFrame API,
> and will provide a stable version for people to try soon.
>
> Thabnks,
> -Jason
>
>
> On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer  wrote:
>
>> Hi,
>>
>> On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao  wrote:
>>
>>>  Intel has a prototype for doing this, SaiSai and Jason are the
>>> authors. Probably you can ask them for some materials.
>>>
>>
>> The github repository is here: https://github.com/intel-spark/stream-sql
>>
>> Also, what I did is writing a wrapper class SchemaDStream that internally
>> holds a DStream[Row] and a DStream[StructType] (the latter having just one
>> element in every RDD) and then allows to do
>> - operations SchemaRDD => SchemaRDD using
>> `rowStream.transformWith(schemaStream, ...)`
>> - in particular you can register this stream's data as a table this way
>> - and via a companion object with a method `fromSQL(sql: String):
>> SchemaDStream` you can get a new stream from previously registered tables.
>>
>> However, you are limited to batch-internal operations, i.e., you can't
>> aggregate across batches.
>>
>> I am not able to share the code at the moment, but will within the next
>> months. It is not very advanced code, though, and should be easy to
>> replicate. Also, I have no idea about the performance of transformWith
>>
>> Tobias
>>
>>
>

RE: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Wang, Ningjun (LNG-NPV)

Thanks for the suggestion. I will try that.

Ningjun

From: Silvio Fiorito [mailto:silvio.fior...@granturing.com]
Sent: Wednesday, March 11, 2015 12:40 AM
To: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: Re: Is it possible to use windows service to start and stop spark 
standalone cluster

Have you tried Apache Daemon? 
http://commons.apache.org/proper/commons-daemon/procrun.html

From: , "Ningjun (LNG-NPV)"
Date: Tuesday, March 10, 2015 at 11:47 PM
To: "user@spark.apache.org"
Subject: Is it possible to use windows service to start and stop spark 
standalone cluster

We are using spark stand alone cluster on Windows 2008 R2. I can start spark 
clusters by open an command prompt and run the following

bin\spark-class.cmd org.apache.spark.deploy.master.Master

bin\spark-class.cmd org.apache.spark.deploy.worker.Worker 
spark://mywin.mydomain.com:7077

I can stop spark cluster by pressing Ctril-C.

The problem is that if the machine is reboot, I have to manually start the 
spark cluster again as above. Is it possible to use a windows service to start 
cluster? This way when the machine is reboot, the windows service will 
automatically restart spark cluster. How to stop spark cluster using windows 
service is also a challenge.

Please advise.

Thanks

Ningjun

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Cesar Flores

Hi:

Thanks for both answers. One final question. *This registerTempTable is not
an extra process that the SQL queries need to do that may decrease
performance over the language integrated method calls? *The thing is that I
am planning to use them in the current version of the ML Pipeline
transformers classes for feature extraction, and If I need to save the
input and maybe output SchemaRDD of the transform function in every
transformer, this may not very efficient.

Thanks

On Tue, Mar 10, 2015 at 8:20 PM, Tobias Pfeiffer  wrote:

> Hi,
>
> On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores  wrote:
>
>> I am new to the SchemaRDD class, and I am trying to decide in using SQL
>> queries or Language Integrated Queries (
>> https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
>> ).
>>
>> Can someone tell me what is the main difference between the two
>> approaches, besides using different syntax? Are they interchangeable? Which
>> one has better performance?
>>
>
> One difference is that the language integrated queries are method calls on
> the SchemaRDD you want to work on, which requires you have access to the
> object at hand. The SQL queries are passed to a method of the SQLContext
> and you have to call registerTempTable() on the SchemaRDD you want to use
> beforehand, which can basically happen at an arbitrary location of your
> program. (I don't know if I could express what I wanted to say.) That may
> have an influence on how you design your program and how the different
> parts work together.
>
> Tobias
>

-- 
Cesar Flores

Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos

Hi Spark Community,

We would like to define exception handling behavior on RDD instantiation /
build. Since the RDD is lazily evaluated, it seems like we are forced to
put all exception handling in the first action call?

This is an example of something that would be nice:

def myRDD = {
Try {
val rdd = sc.textFile(...)
} match {
Failure(e) => Handle ...
}
}

myRDD.reduceByKey(...) //don't need to worry about that exception here

The reason being that we want to try to avoid having to copy paste
exception handling boilerplate on every first action. We would love to
define this once somewhere for the RDD build code and just re-use.

Is there a best practice for this? Are we missing something here?

thanks,
Michal

Re: PairRDD serialization exception

2015-03-11 Thread Sean Owen

This usually means you are mixing different versions of code. Here it
is complaining about a Spark class. Are you sure you built vs the
exact same Spark binaries, and are not including them in your app?

On Wed, Mar 11, 2015 at 1:40 PM, manasdebashiskar
 wrote:
> (This is a repost. May be a simpler subject will fetch more attention among
> experts)
>
> Hi,
>  I have a CDH5.3.2(Spark1.2) cluster.
>  I am getting an local class incompatible exception for my spark application
> during an action.
> All my classes are case classes(To best of my knowledge)
>
> Appreciate any help.
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure:
> Lost task 0.3 in stage 3.0 (TID 346, datanode02):
> java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local
> class incompatible:stream classdesc serialVersionUID = 8789839749593513237,
> local class serialVersionUID = -4145741279224749316
> at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Thanks
> Manas
> Manas Kar
>
> 
> View this message in context: PairRDD serialization exception
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Writing wide parquet file in Spark SQL

2015-03-11 Thread Ravindra

Even I am keen to learn an answer for this but as an alternate you can use
hive to create a table "stored as parquet" and then use it in spark.

On Wed, Mar 11, 2015 at 1:44 AM kpeng1  wrote:

> Hi All,
>
> I am currently trying to write a very wide file into parquet using spark
> sql.  I have 100K column records that I am trying to write out, but of
> course I am running into space issues(out of memory - heap space).  I was
> wondering if there are any tweaks or work arounds for this.
>
> I am basically calling saveAsParquetFile on the schemaRDD.
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Writing-wide-parquet-file-in-Spark-SQL-tp21995.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai

Yes, a previous prototype is available
https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at
last year's Spark Summit (
http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark
)

We are currently porting the prototype to use the latest DataFrame API, and
will provide a stable version for people to try soon.

Thabnks,
-Jason


On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer  wrote:

> Hi,
>
> On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao  wrote:
>
>>  Intel has a prototype for doing this, SaiSai and Jason are the authors.
>> Probably you can ask them for some materials.
>>
>
> The github repository is here: https://github.com/intel-spark/stream-sql
>
> Also, what I did is writing a wrapper class SchemaDStream that internally
> holds a DStream[Row] and a DStream[StructType] (the latter having just one
> element in every RDD) and then allows to do
> - operations SchemaRDD => SchemaRDD using
> `rowStream.transformWith(schemaStream, ...)`
> - in particular you can register this stream's data as a table this way
> - and via a companion object with a method `fromSQL(sql: String):
> SchemaDStream` you can get a new stream from previously registered tables.
>
> However, you are limited to batch-internal operations, i.e., you can't
> aggregate across batches.
>
> I am not able to share the code at the moment, but will within the next
> months. It is not very advanced code, though, and should be easy to
> replicate. Also, I have no idea about the performance of transformWith
>
> Tobias
>
>

PairRDD serialization exception

2015-03-11 Thread manasdebashiskar

(This is a repost. May be a simpler subject will fetch more attention among
experts)

Hi,
 I have a CDH5.3.2(Spark1.2) cluster.
 I am getting an local class incompatible exception for my spark
application during an action.
All my classes are case classes(To best of my knowledge)

Appreciate any help.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 3.0 (TID 346, datanode02):
java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local
class incompatible:stream classdesc serialVersionUID = 8789839749593513237,
local class serialVersionUID = -4145741279224749316
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Thanks
Manas




-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PairRDD-serialization-exception-tp21999.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: skewed outer join with spark 1.2.0 - memory consumption

2015-03-11 Thread Marcin Cylke

On Wed, 11 Mar 2015 11:19:56 +0100
Marcin Cylke  wrote:

> Hi
> 
> I'm trying to do a join of two datasets: 800GB with ~50MB.

The job finishes if I set spark.yarn.executor.memoryOverhead to 2048MB.
If it is around 1000MB it fails with "executor lost" errors.

My spark settings are:

- executor cores - 8
- num executors - 32
- executor memory - 4g

Regards
Marcin

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark fpg large basket

2015-03-11 Thread Sean Barzilay

My min support is low and after filling out all my space I am applying a
filter on the results to only get item seta that interest me

On Wed, 11 Mar 2015 1:58 pm Sean Owen  wrote:

> Have you looked at how big your output is? for example, if your min
> support is very low, you will output a massive volume of frequent item
> sets. If that's the case, then it may be expected that it's taking
> ages to write terabytes of data.
>
> On Wed, Mar 11, 2015 at 8:34 AM, Sean Barzilay 
> wrote:
> > The program spends its time when I am writing the output to a text file
> and
> > I am using 70 partitions
> >
> >
> > On Wed, 11 Mar 2015 9:55 am Sean Owen  wrote:
> >>
> >> I don't think there is enough information here. Where is the program
> >> spending its time? where does it "stop"? how many partitions are
> >> there?
> >>
> >> On Wed, Mar 11, 2015 at 7:10 AM, Akhil Das 
> >> wrote:
> >> > You need to set spark.cores.max to a number say 16, so that on all 4
> >> > machines the tasks will get distributed evenly, Another thing would be
> >> > to
> >> > set spark.default.parallelism if you haven't tried already.
> >> >
> >> > Thanks
> >> > Best Regards
> >> >
> >> > On Wed, Mar 11, 2015 at 12:27 PM, Sean Barzilay <
> sesnbarzi...@gmail.com>
> >> > wrote:
> >> >>
> >> >> I am running on a 4 workers cluster each having between 16 to 30
> cores
> >> >> and
> >> >> 50 GB of ram
> >> >>
> >> >>
> >> >> On Wed, 11 Mar 2015 8:55 am Akhil Das 
> >> >> wrote:
> >> >>>
> >> >>> Depending on your cluster setup (cores, memory), you need to specify
> >> >>> the
> >> >>> parallelism/repartition the data.
> >> >>>
> >> >>> Thanks
> >> >>> Best Regards
> >> >>>
> >> >>> On Wed, Mar 11, 2015 at 12:18 PM, Sean Barzilay
> >> >>> 
> >> >>> wrote:
> >> 
> >>  Hi I am currently using spark 1.3.0-snapshot to run the fpg
> algorithm
> >>  from the mllib library. When I am trying to run the algorithm over
> a
> >>  large
> >>  basket(over 1000 items) the program seems to never finish. Did
> anyone
> >>  find a
> >>  workaround for this problem?
> >> >>>
> >> >>>
> >> >
>

Re: Joining data using Latitude, Longitude

2015-03-11 Thread Manas Kar

There are few techniques currently available.
Geomesa which uses GeoHash also can be proved useful.(
https://github.com/locationtech/geomesa)

Other potential candidate is
https://github.com/Esri/gis-tools-for-hadoop especially
https://github.com/Esri/geometry-api-java for inner customization.

If you want to ask questions like nearby me then these are the basic steps.
1) Index your geometry data which uses R-Tree.
2) Write your joiner logic that takes advantage of the index tree to get
you faster access.

Thanks
Manas


On Wed, Mar 11, 2015 at 5:55 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Ted Dunning and Ellen Friedman's "Time Series Databases" has a section on
> this with some approaches to geo-encoding:
>
> https://www.mapr.com/time-series-databases-new-ways-store-and-access-data
> http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf
>
> On Tue, Mar 10, 2015 at 3:53 PM, John Meehan  wrote:
>
>> There are some techniques you can use If you geohash
>>  the lat-lngs.  They will
>> naturally be sorted by proximity (with some edge cases so watch out).  If
>> you go the join route, either by trimming the lat-lngs or geohashing them,
>> you’re essentially grouping nearby locations into buckets — but you have to
>> consider the borders of the buckets since the nearest location may actually
>> be in an adjacent bucket.  Here’s a paper that discusses an implementation:
>> http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf
>>
>> On Mar 9, 2015, at 11:42 PM, Akhil Das 
>> wrote:
>>
>> Are you using SparkSQL for the join? In that case I'm not quiet sure you
>> have a lot of options to join on the nearest co-ordinate. If you are using
>> the normal Spark code (by creating key-pair on lat,lon) you can apply
>> certain logic like trimming the lat,lon etc. If you want more specific
>> computing then you are better off using haversine formula.
>> 
>>
>>
>>
>

Unable to saveToCassandra while cassandraTable works fine

2015-03-11 Thread Tiwari, Tarun

Hi,

I am stuck at this for 3 days now. I am using the spark-cassandra-connector 
with spark and I am able to make RDDs with sc.cassandraTable function that 
means spark is able to communicate with Cassandra properly.

But somehow the saveToCassandra is not working. Below are the steps I am doing.
Does it have something to do with my spark-env or spark-defaults? Am I missing 
something critical ?

scala> import com.datastax.spark.connector._
scala> 
sc.addJar("/home/analytics/Installers/spark-1.1.1/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar")
scala> val myTable = sc.cassandraTable("test2", " words")
scala> myTable.collect()
--- this works perfectly fine.

scala> val data = sc.parallelize(Seq((81, "XXX"), (82, "")))
scala> data.saveToCassandra("test2", "words", SomeColumns("word", "count"))
--- this fails

15/03/11 15:16:45 INFO Cluster: New Cassandra host /10.131.141.192:9042 added
15/03/11 15:16:45 INFO LocalNodeFirstLoadBalancingPolicy: Added host 
10.131.141.192 (datacenter1)
15/03/11 15:16:45 INFO Cluster: New Cassandra host /10.131.141.193:9042 added
15/03/11 15:16:45 INFO LocalNodeFirstLoadBalancingPolicy: Added host 
10.131.141.193 (datacenter1)
15/03/11 15:16:45 INFO Cluster: New Cassandra host /10.131.141.200:9042 added
15/03/11 15:16:45 INFO CassandraConnector: Connected to Cassandra cluster: 
wfan_cluster_DB
15/03/11 15:16:45 INFO SparkContext: Starting job: runJob at 
RDDFunctions.scala:29
15/03/11 15:16:45 INFO DAGScheduler: Got job 1 (runJob at 
RDDFunctions.scala:29) with 2 output partitions (allowLocal=false)
15/03/11 15:16:45 INFO DAGScheduler: Final stage: Stage 1(runJob at 
RDDFunctions.scala:29)
15/03/11 15:16:45 INFO DAGScheduler: Parents of final stage: List()
15/03/11 15:16:45 INFO DAGScheduler: Missing parents: List()
15/03/11 15:16:45 INFO DAGScheduler: Submitting Stage 1 
(ParallelCollectionRDD[1] at parallelize at :20), which has no missing 
parents
15/03/11 15:16:45 INFO MemoryStore: ensureFreeSpace(7400) called with 
curMem=1792, maxMem=2778778828
15/03/11 15:16:45 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 7.2 KB, free 2.6 GB)
15/03/11 15:16:45 INFO MemoryStore: ensureFreeSpace(3602) called with 
curMem=9192, maxMem=2778778828
15/03/11 15:16:45 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 3.5 KB, free 2.6 GB)
15/03/11 15:16:45 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
10.131.141.200:56502 (size: 3.5 KB, free: 2.6 GB)
15/03/11 15:16:45 INFO BlockManagerMaster: Updated info of block 
broadcast_1_piece0
15/03/11 15:16:45 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 
(ParallelCollectionRDD[1] at parallelize at :20)
15/03/11 15:16:45 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/03/11 15:16:45 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 
10.131.141.192, PROCESS_LOCAL, 1216 bytes)
15/03/11 15:16:45 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, 
10.131.141.193, PROCESS_LOCAL, 1217 bytes)
15/03/11 15:16:45 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
10.131.141.193:51660 (size: 3.5 KB, free: 267.3 MB)
15/03/11 15:16:45 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
10.131.141.192:32875 (size: 3.5 KB, free: 267.3 MB)
15/03/11 15:16:45 INFO CassandraConnector: Disconnected from Cassandra cluster: 
wfan_cluster_DB
15/03/11 15:16:46 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, 
10.131.141.192): java.lang.NoSuchMethodError: 
org.apache.spark.executor.TaskMetrics.outputMetrics()Lscala/Option;

com.datastax.spark.connector.metrics.OutputMetricsUpdater$.apply(OutputMetricsUpdater.scala:70)

com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:119)

com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:29)

com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:29)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
15/03/11 15:16:46 INFO TaskSetManager: Starting task 0.1 in stage 1.0 (TID 4, 
10.131.141.192, PROCESS_LOCAL, 1216 bytes)
15/03/11 15:16:46 INFO ConnectionManager: Key not valid ? 
sun.nio.ch.SelectionKeyImpl@29ffe58e
15/03/11 15:16:46 INFO ConnectionManager: key already cancelled ? 
sun.nio.ch.SelectionKeyImpl@29ffe58e
java.nio.channels.CancelledKeyException
at 
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:392)
at 
org.apache.spark.network.ConnectionManager

Re: Spark fpg large basket

2015-03-11 Thread Sean Owen

Have you looked at how big your output is? for example, if your min
support is very low, you will output a massive volume of frequent item
sets. If that's the case, then it may be expected that it's taking
ages to write terabytes of data.

On Wed, Mar 11, 2015 at 8:34 AM, Sean Barzilay  wrote:
> The program spends its time when I am writing the output to a text file and
> I am using 70 partitions
>
>
> On Wed, 11 Mar 2015 9:55 am Sean Owen  wrote:
>>
>> I don't think there is enough information here. Where is the program
>> spending its time? where does it "stop"? how many partitions are
>> there?
>>
>> On Wed, Mar 11, 2015 at 7:10 AM, Akhil Das 
>> wrote:
>> > You need to set spark.cores.max to a number say 16, so that on all 4
>> > machines the tasks will get distributed evenly, Another thing would be
>> > to
>> > set spark.default.parallelism if you haven't tried already.
>> >
>> > Thanks
>> > Best Regards
>> >
>> > On Wed, Mar 11, 2015 at 12:27 PM, Sean Barzilay 
>> > wrote:
>> >>
>> >> I am running on a 4 workers cluster each having between 16 to 30 cores
>> >> and
>> >> 50 GB of ram
>> >>
>> >>
>> >> On Wed, 11 Mar 2015 8:55 am Akhil Das 
>> >> wrote:
>> >>>
>> >>> Depending on your cluster setup (cores, memory), you need to specify
>> >>> the
>> >>> parallelism/repartition the data.
>> >>>
>> >>> Thanks
>> >>> Best Regards
>> >>>
>> >>> On Wed, Mar 11, 2015 at 12:18 PM, Sean Barzilay
>> >>> 
>> >>> wrote:
>> 
>>  Hi I am currently using spark 1.3.0-snapshot to run the fpg algorithm
>>  from the mllib library. When I am trying to run the algorithm over a
>>  large
>>  basket(over 1000 items) the program seems to never finish. Did anyone
>>  find a
>>  workaround for this problem?
>> >>>
>> >>>
>> >

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Running Spark from Scala source files other than main file

2015-03-11 Thread Aung Kyaw Htet

Hi Everyone,

I am developing a scala app, in which the main object does not call the
SparkContext, but another object defined in the same package creates it,
run spark operations and closes it. The jar file is built successfully in
maven, but when I called spark-submit with this jar, that spark does not
seem to execute any code.

So my code looks like

[Main.scala]

object Main(args) {
  def main() {
/*check parameters */
 Component1.start(parameters)
}
  }

[Component1.scala]

object Component1{
  def start{
   val sc = new SparkContext(conf)
   /* do spark operations */
   sc.close()
  }
}

The above code compiles into Main.jar but spark-submit does not execute
anything and does not show me any error (not in the logs either.)

spark-submit master= spark:// Main.jar

I've got this all the code working before when I wrote a single scala file,
but now that I am separating into multiple scala source files, something
isn't running right.

Any advice on this would be greatly appreciated!

Regards,
Aung

hbase sql query

2015-03-11 Thread Udbhav Agarwal

Hi,
How can we simply cache hbase table and do sql query via java api in spark.



Thanks,
Udbhav Agarwal

"Timed out while stopping the job generator" plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer

Hi,

it seems like I am unable to shut down my StreamingContext properly, both
in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster
mode, subsequent use of a new StreamingContext will raise
an InvalidActorNameException.

I use

  logger.info("stoppingStreamingContext")
  staticStreamingContext.stop(stopSparkContext=false,
stopGracefully=true)
  logger.debug("done")

and have in my output logs

  19:16:47.708 [ForkJoinPool-2-worker-11] INFO  stopping StreamingContext
[... output from other threads ...]
  19:17:07.729 [ForkJoinPool-2-worker-11] WARN  scheduler.JobGenerator -
Timed out while stopping the job generator (timeout = 2)
  19:17:07.739 [ForkJoinPool-2-worker-11] DEBUG done

The processing itself is complete, i.e., the batch currently processed at
the time of stop() is finished and no further batches are processed.
However, something keeps the streaming context from stopping properly. In
local[n] mode, this is not actually a problem (other than I have to wait 20
seconds for shutdown), but in yarn-cluster mode, I get an error

  akka.actor.InvalidActorNameException: actor name [JobGenerator] is not
unique!

when I start a (newly created) StreamingContext, and I was wondering what
* is the issue with stop()
* is the difference between local[n] and yarn-cluster mode.

Some possible reasons:
* On my executors, I use a networking library that depends on netty and
doesn't properly shut down the event loop. (That has not been a problem in
the past, though.)
* I have a non-empty state (from using updateStateByKey()) that is
checkpointed to /tmp/spark (in local mode) and hdfs:///tmp/spark (in
yarn-cluster) mode, could that be an issue? (In fact, I have not seen this
error in any non-stateful stream applications before.)

Any help much appreciated!

Thanks
Tobias

Re: Example of partitionBy in pyspark

2015-03-11 Thread Ted Yu

Should the comma after 1 be colon ?



> On Mar 11, 2015, at 1:41 AM, Stephen Boesch  wrote:
> 
> 
> I am finding that partitionBy is hanging - and it is not clear whether the 
> custom partitioner is even being invoked (i put an exception in there and can 
> not see it in the worker logs).
> 
> The structure is similar to the following:
> 
> inputPairedRdd = sc.parallelize([{0:"Entry1",1,"Entry2"}])
> 
> def identityPartitioner(key):
># just use the id as the partition number
># I am uncertain how to code this
> 
> partedRdd = inputPairedRdd.partitionBy(newNumPartitions, identityPartitioner)
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

skewed outer join with spark 1.2.0 - memory consumption

2015-03-11 Thread Marcin Cylke

Hi

I'm trying to do a join of two datasets: 800GB with ~50MB.

My code looks like this:

  private def parseClickEventLine(line: String, jsonFormatBC: 
Broadcast[LazyJsonFormat]): ClickEvent = {
val json = line.parseJson.asJsObject
val eventJson = if (json.fields.contains("recommendationId")) json else 
json.fields("message").asJsObject

jsonFormatBC.value.clickEventJsonFormat.read(eventJson)
  }

val jsonFormatBc: Broadcast[LazyJsonFormat] = sc.broadcast(new 
LazyJsonFormat)

val views = sc.recoLogRdd(jobConfig.viewsDirectory)
  .map(view => (view.id.toString, view))

val clicks = sc.textFile(s"${jobConfig.clicksDirectory}/*")
  .map(parseClickEventLine(_, jsonFormatBc))
  .map(click => (click.recommendationId, click))

val clicksCounts = views.leftOuterJoin(clicks).map({ case 
(recommendationId, (view, click)) =>
  val metaItemType = click.flatMap(c => 
view.itemDetailsById.get(c.itemIdStr).map(_.metaItemType))
  (view, metaItemType) -> click.map(_ => 1).getOrElse(0)
})
clicksCounts.reduceByKey(_ + 
_).map(toCSV).saveAsTextFile(jobConfig.outputDirectory)

I'm using Spark 1.2.0 and have the following options set:

spark.default.parallelism = 24
spark.serializer': 'org.apache.spark.serializer.KryoSerializer',
spark.test.disableBlockManagerHeartBeat': 'true',
spark.shuffle.netty.connect.timeout': '3',
spark.storage.blockManagerSlaveTimeoutMs': '3',
spark.yarn.user.classpath.first': 'true',
spark.yarn.executor.memoryOverhead': '1536'

The job is run on YARN and I see errors in container logs:

015-03-11 09:16:56,629 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=24500,containerID=container_1425476483191_402083_01_19] is 
running beyond physical memory limits. Current usage: 6.0 GB of 6 GB physical 
memory used; 6.9 GB of 12.6 GB virtual memory used. Killing container.

So the problems is related to the excessive use of memory.

Could you advise me what should I fix in my code to make it work for my 
usecase? 
The strange thing is, that the code worked earlier, with versions around 1.0.0. 
Is it possible that changes between 1.0.0 and 1.2.0 caused that kind of 
regression?

Regards 
Marcin

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Example of partitionBy in pyspark

2015-03-11 Thread Stephen Boesch

I am finding that partitionBy is hanging - and it is not clear whether the
custom partitioner is even being invoked (i put an exception in there and
can not see it in the worker logs).

The structure is similar to the following:

inputPairedRdd = sc.parallelize([{0:"Entry1",1,"Entry2"}])

def identityPartitioner(key):
   # just use the id as the partition number
   # I am uncertain how to code this

partedRdd = inputPairedRdd.partitionBy(newNumPartitions,
identityPartitioner)

Re: Spark fpg large basket

2015-03-11 Thread Sean Barzilay

The program spends its time when I am writing the output to a text file and
I am using 70 partitions

On Wed, 11 Mar 2015 9:55 am Sean Owen  wrote:

> I don't think there is enough information here. Where is the program
> spending its time? where does it "stop"? how many partitions are
> there?
>
> On Wed, Mar 11, 2015 at 7:10 AM, Akhil Das 
> wrote:
> > You need to set spark.cores.max to a number say 16, so that on all 4
> > machines the tasks will get distributed evenly, Another thing would be to
> > set spark.default.parallelism if you haven't tried already.
> >
> > Thanks
> > Best Regards
> >
> > On Wed, Mar 11, 2015 at 12:27 PM, Sean Barzilay 
> > wrote:
> >>
> >> I am running on a 4 workers cluster each having between 16 to 30 cores
> and
> >> 50 GB of ram
> >>
> >>
> >> On Wed, 11 Mar 2015 8:55 am Akhil Das 
> wrote:
> >>>
> >>> Depending on your cluster setup (cores, memory), you need to specify
> the
> >>> parallelism/repartition the data.
> >>>
> >>> Thanks
> >>> Best Regards
> >>>
> >>> On Wed, Mar 11, 2015 at 12:18 PM, Sean Barzilay <
> sesnbarzi...@gmail.com>
> >>> wrote:
> 
>  Hi I am currently using spark 1.3.0-snapshot to run the fpg algorithm
>  from the mllib library. When I am trying to run the algorithm over a
> large
>  basket(over 1000 items) the program seems to never finish. Did anyone
> find a
>  workaround for this problem?
> >>>
> >>>
> >
>

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang

Thanks Sean. I'll ask our Hadoop admin.

Actually I didn't find hadoop.tmp.dir in the Hadoop settings...using user
home is suggested by other users.

Jianshi

On Wed, Mar 11, 2015 at 3:51 PM, Sean Owen  wrote:

> You shouldn't use /tmp, but it doesn't mean you should use user home
> directories either. Typically, like in YARN, you would a number of
> directories (on different disks) mounted and configured for local
> storage for jobs.
>
> On Wed, Mar 11, 2015 at 7:42 AM, Jianshi Huang 
> wrote:
> > Unfortunately /tmp mount is really small in our environment. I need to
> > provide a per-user setting as the default value.
> >
> > I hacked bin/spark-class for the similar effect. And spark-defaults.conf
> can
> > override it. :)
> >
> > Jianshi
> >
> > On Wed, Mar 11, 2015 at 3:28 PM, Patrick Wendell 
> wrote:
> >>
> >> We don't support expressions or wildcards in that configuration. For
> >> each application, the local directories need to be constant. If you
> >> have users submitting different Spark applications, those can each set
> >> spark.local.dirs.
> >>
> >> - Patrick
> >>
> >> On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang <
> jianshi.hu...@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I need to set per-user spark.local.dir, how can I do that?
> >> >
> >> > I tried both
> >> >
> >> >   /x/home/${user.name}/spark/tmp
> >> > and
> >> >   /x/home/${USER}/spark/tmp
> >> >
> >> > And neither worked. Looks like it has to be a constant setting in
> >> > spark-defaults.conf. Right?
> >> >
> >> > Any ideas how to do that?
> >> >
> >> > Thanks,
> >> > --
> >> > Jianshi Huang
> >> >
> >> > LinkedIn: jianshi
> >> > Twitter: @jshuang
> >> > Github & Blog: http://huangjs.github.com/
> >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Spark fpg large basket

2015-03-11 Thread Sean Owen

I don't think there is enough information here. Where is the program
spending its time? where does it "stop"? how many partitions are
there?

On Wed, Mar 11, 2015 at 7:10 AM, Akhil Das  wrote:
> You need to set spark.cores.max to a number say 16, so that on all 4
> machines the tasks will get distributed evenly, Another thing would be to
> set spark.default.parallelism if you haven't tried already.
>
> Thanks
> Best Regards
>
> On Wed, Mar 11, 2015 at 12:27 PM, Sean Barzilay 
> wrote:
>>
>> I am running on a 4 workers cluster each having between 16 to 30 cores and
>> 50 GB of ram
>>
>>
>> On Wed, 11 Mar 2015 8:55 am Akhil Das  wrote:
>>>
>>> Depending on your cluster setup (cores, memory), you need to specify the
>>> parallelism/repartition the data.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Wed, Mar 11, 2015 at 12:18 PM, Sean Barzilay 
>>> wrote:

 Hi I am currently using spark 1.3.0-snapshot to run the fpg algorithm
 from the mllib library. When I am trying to run the algorithm over a large
 basket(over 1000 items) the program seems to never finish. Did anyone find 
 a
 workaround for this problem?
>>>
>>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Sean Owen

You shouldn't use /tmp, but it doesn't mean you should use user home
directories either. Typically, like in YARN, you would a number of
directories (on different disks) mounted and configured for local
storage for jobs.

On Wed, Mar 11, 2015 at 7:42 AM, Jianshi Huang  wrote:
> Unfortunately /tmp mount is really small in our environment. I need to
> provide a per-user setting as the default value.
>
> I hacked bin/spark-class for the similar effect. And spark-defaults.conf can
> override it. :)
>
> Jianshi
>
> On Wed, Mar 11, 2015 at 3:28 PM, Patrick Wendell  wrote:
>>
>> We don't support expressions or wildcards in that configuration. For
>> each application, the local directories need to be constant. If you
>> have users submitting different Spark applications, those can each set
>> spark.local.dirs.
>>
>> - Patrick
>>
>> On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang 
>> wrote:
>> > Hi,
>> >
>> > I need to set per-user spark.local.dir, how can I do that?
>> >
>> > I tried both
>> >
>> >   /x/home/${user.name}/spark/tmp
>> > and
>> >   /x/home/${USER}/spark/tmp
>> >
>> > And neither worked. Looks like it has to be a constant setting in
>> > spark-defaults.conf. Right?
>> >
>> > Any ideas how to do that?
>> >
>> > Thanks,
>> > --
>> > Jianshi Huang
>> >
>> > LinkedIn: jianshi
>> > Twitter: @jshuang
>> > Github & Blog: http://huangjs.github.com/
>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang

Unfortunately /tmp mount is really small in our environment. I need to
provide a per-user setting as the default value.

I hacked bin/spark-class for the similar effect. And spark-defaults.conf
can override it. :)

Jianshi

On Wed, Mar 11, 2015 at 3:28 PM, Patrick Wendell  wrote:

> We don't support expressions or wildcards in that configuration. For
> each application, the local directories need to be constant. If you
> have users submitting different Spark applications, those can each set
> spark.local.dirs.
>
> - Patrick
>
> On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang 
> wrote:
> > Hi,
> >
> > I need to set per-user spark.local.dir, how can I do that?
> >
> > I tried both
> >
> >   /x/home/${user.name}/spark/tmp
> > and
> >   /x/home/${USER}/spark/tmp
> >
> > And neither worked. Looks like it has to be a constant setting in
> > spark-defaults.conf. Right?
> >
> > Any ideas how to do that?
> >
> > Thanks,
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Patrick Wendell

We don't support expressions or wildcards in that configuration. For
each application, the local directories need to be constant. If you
have users submitting different Spark applications, those can each set
spark.local.dirs.

- Patrick

On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang  wrote:
> Hi,
>
> I need to set per-user spark.local.dir, how can I do that?
>
> I tried both
>
>   /x/home/${user.name}/spark/tmp
> and
>   /x/home/${USER}/spark/tmp
>
> And neither worked. Looks like it has to be a constant setting in
> spark-defaults.conf. Right?
>
> Any ideas how to do that?
>
> Thanks,
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Temp directory used by spark-submit

2015-03-11 Thread Akhil Das

After setting SPARK_LOCAL_DIRS/SPARK_WORKER_DIR you need to restart your
spark instances (stop-all.sh and start-all.sh), You can also try setting
java.io.tmpdir while creating the SparkContext.

Thanks
Best Regards

On Wed, Mar 11, 2015 at 1:47 AM, Justin Yip  wrote:

> Hello,
>
> I notice that when I run spark-submit, a temporary directory containing
> all the jars and resource files is created under /tmp (for example,
> /tmp/spark-fd1b77fc-50f4-4b1c-a122-5cf36969407c).
>
> Sometimes this directory gets cleanup after the job, but sometimes it
> doesn't, which fills up my root directory.
>
> Is there a way to specify this directory so that it doesn't point to /tmp?
> I tried SPARK_LOCAL_DIRS, but it doesn't help, neither does
> SPARK_WORKER_DIR.
>
> Thanks.
>
> Justin
>

Re: S3 SubFolder Write Issues

2015-03-11 Thread Akhil Das

Does it write anything in BUCKET/SUB_FOLDER/output?

Thanks
Best Regards

On Wed, Mar 11, 2015 at 10:15 AM, cpalm3  wrote:

> Hi All,
>
> I am hoping someone has seen this issue before with S3, as I haven't been
> able to find a solution for this problem.
>
> When I try to save as Text file to s3 into a subfolder, it only ever writes
> out to the bucket level folder
> and produces block level generated file names and not my output folder as I
> specified.
> Below is the sample code in Scala, I have also seen this behavior in the
> Java code.
>
>  val out =  inputRdd.map {ir => mapFunction(ir)}.groupByKey().mapValues { x
> => mapValuesFunction(x) }
>.saveAsTextFile("s3://BUCKET/SUB_FOLDER/output"
>
> Any ideas on how to get saveAsTextFile to write to an S3 subfolder?
>
> Thanks,
> Chris
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/S3-SubFolder-Write-Issues-tp21997.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang

Hi,

I need to set per-user spark.local.dir, how can I do that?

I tried both

  /x/home/${user.name}/spark/tmp
and
  /x/home/${USER}/spark/tmp

And neither worked. Looks like it has to be a constant setting in
spark-defaults.conf. Right?

Any ideas how to do that?

Thanks,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Spark fpg large basket

2015-03-11 Thread Akhil Das

You need to set spark.cores.max to a number say 16, so that on all 4
machines the tasks will get distributed evenly, Another thing would be to
set spark.default.parallelism if you haven't tried already.

Thanks
Best Regards

On Wed, Mar 11, 2015 at 12:27 PM, Sean Barzilay 
wrote:

> I am running on a 4 workers cluster each having between 16 to 30 cores and
> 50 GB of ram
>
> On Wed, 11 Mar 2015 8:55 am Akhil Das  wrote:
>
>> Depending on your cluster setup (cores, memory), you need to specify the
>> parallelism/repartition the data.
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Mar 11, 2015 at 12:18 PM, Sean Barzilay 
>> wrote:
>>
>>> Hi I am currently using spark 1.3.0-snapshot to run the fpg algorithm
>>> from the mllib library. When I am trying to run the algorithm over a large
>>> basket(over 1000 items) the program seems to never finish. Did anyone find
>>> a workaround for this problem?
>>>
>>
>>

Re: SocketTextStream not working from messages sent from other host

2015-03-11 Thread Akhil Das

May be you can use this code for your purpose
https://gist.github.com/akhld/4286df9ab0677a555087 It basically sends the
content of the given file through Socket (both IO/NIO), i used it for a
benchmark between IO and NIO.

Thanks
Best Regards

On Wed, Mar 11, 2015 at 11:36 AM, Cui Lin  wrote:

>   Dear all,
>
>  I tried the socketTextStream to receive message from a port, similar to
> WordCount example, using
> socketTextStream(host, ip.toInt, StorageLevel.MEMORY_AND_DISK_SER)…
>
>  The problem is that I can only receive messages typed from “nc –lk [port
> no.] from my localhost, while the messages sent from other host to that
> port were ignored.
>
>  I can see all messages from “nc –l [port no.]", why such thing could
> happen?
>
>
>  Best regards,
>
>  Cui Lin
>

Re: S3 SubFolder Write Issues

2015-03-11 Thread Calum Leslie

You want the s3n:// ("native") protocol rather than s3://. s3:// is a block
filesystem based on S3 that doesn't respect paths.

More information on the Hadoop site: https://wiki.apache.org/hadoop/AmazonS3

Calum.

On Wed, 11 Mar 2015 04:47 cpalm3  wrote:

> Hi All,
>
> I am hoping someone has seen this issue before with S3, as I haven't been
> able to find a solution for this problem.
>
> When I try to save as Text file to s3 into a subfolder, it only ever writes
> out to the bucket level folder
> and produces block level generated file names and not my output folder as I
> specified.
> Below is the sample code in Scala, I have also seen this behavior in the
> Java code.
>
>  val out =  inputRdd.map {ir => mapFunction(ir)}.groupByKey().mapValues {
> x
> => mapValuesFunction(x) }
>.saveAsTextFile("s3://BUCKET/SUB_FOLDER/output"
>
> Any ideas on how to get saveAsTextFile to write to an S3 subfolder?
>
> Thanks,
> Chris
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/S3-SubFolder-Write-Issues-tp21997.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

92 matches

Mail list logo