Spark Maven Build

2015-08-07 Thread Benyi Wang
I'm trying to build spark 1.4.1 against CDH 5.3.2. I created a profile
called cdh5.3.2 in spark_parent.pom, made some changes for
sql/hive/v0.13.1, and the build finished successfully.

Here is my problem:

   - If I run `mvn -Pcdh5.3.2,yarn,hive install`, the artifacts are
   installed into my local repo.
   - I expected `hadoop-client` version should be
   `hadoop-client-2.5.0-cdh5.3.2`, but it actually `hadoop-client-2.2.0`.

If I add a dependency of `spark-sql-1.2.0-cdh5.3.2`, the version is
`hadoop-client-2.5.0-cdh5.3.2`.

What's the trick behind it?


using Spark or pig group by efficient in my use case?

2015-08-07 Thread linlma
I have a tens of million records, which is customer ID and city ID pair.
There are tens of millions of unique customer ID, and only a few hundreds
unique city ID. I want to do a merge to get all city ID aggregated for a
specific customer ID, and pull back all records. I want to do this using
group by customer ID using Pig on Hadoop, and wondering if it is the most
efficient way.

Also wondering if there are overhead for sorting in Hadoop (I do not care if
customer1 before customer2 or not, as long as all city are aggregated
correctly for customer1 and customer 2)? Do you think Spark is better?

Here is an example of inputs,

CustomerID1 City1
CustomerID2 City2
CustomerID3 City1
CustomerID1 City3
CustomerID2 City4
I want output like this,

CustomerID1 City1 City3
CustomerID2 City2 City4
CustomerID3 City1

thanks in advance,
Lin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-Spark-or-pig-group-by-efficient-in-my-use-case-tp24178.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [Spark Streaming] Session based windowing like in google dataflow

2015-08-07 Thread Tathagata Das
You can use Spark Streaming's updateStateByKey to do arbitrary
sessionization.
See the example -
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
All it does is store a single number (count of each word seeing since the
beginning), but you can extend it store arbitrary state data. And then you
can use that state to keep track of gaps, windows, etc.
People have done a lot sessionization using this. I am sure others can
chime in.

On Fri, Aug 7, 2015 at 10:48 AM, Ankur Chauhan  wrote:

> Hi all,
>
> I am trying to figure out how to perform equivalent of "Session windows"
> (as mentioned in https://cloud.google.com/dataflow/model/windowing) using
> spark streaming. Is it even possible (i.e. possible to do efficiently at
> scale). Just to expand on the definition:
>
> Taken from the google dataflow documentation:
>
> The simplest kind of session windowing specifies a minimum gap duration.
> All data arriving below a minimum threshold of time delay is grouped into
> the same window. If data arrives after the minimum specified gap duration
> time, this initiates the start of a new window.
>
>
>
>
> Any help would be appreciated.
>
> -- Ankur Chauhan
>


Re: Checkpoint Dir Error in Yarn

2015-08-07 Thread Tathagata Das
Have you tried to do what its suggesting?
If you want to learn more about checkpointing, you can see the programming
guide -
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
For more in depth understanding, you can see my talk -
https://www.youtube.com/watch?v=d5UJonrruHk

On Fri, Aug 7, 2015 at 5:48 PM, Mohit Anchlia 
wrote:

> I am running in yarn-client mode and trying to execute network word count
> example. When I connect through nc I see the following in spark app logs:
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: The
> checkpoint directory has not been set. Please use
> StreamingContext.checkpoint() or SparkContext.checkpoint() to set the
> checkpoint directory.
> at scala.Predef$.assert(Predef.scala:179)
> at
> org.apache.spark.streaming.dstream.DStream.validate(DStream.scala:183)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$validate$10.apply(DStream.scala:229)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$validate$10.apply(DStream.scala:229)
> at scala.collection
>


Re: Spark MLib v/s SparkR

2015-08-07 Thread gen tang
Hi,

It depends on the problem that you work on.
Just as python and R, Mllib focuses on machine learning and SparkR will
focus on statistics, if SparkR follow the way of R.

For instance, If you want to use glm to analyse data:
1. if you are interested only in parameters of model, and use this model to
predict, then you should use Mllib
2. if your focus is on confidence of the model, for example the confidence
interval of result and the significant level of parameters, you should
choose SparkR. However, as there is no glm package to this purpose yet, you
need to code it by yourself.

Hope it can be helpful

Cheers
Gen


On Thu, Aug 6, 2015 at 2:24 AM, praveen S  wrote:

> I was wondering when one should go for MLib or SparkR. What is the
> criteria or what should be considered before choosing either of the
> solutions for data analysis?
> or What is the advantages of Spark MLib over Spark R or advantages of
> SparkR over MLib?
>


Re: Problems getting expected results from hbase_inputformat.py

2015-08-07 Thread gen tang
Hi,

In fact, Pyspark use
org.apache.spark.examples.pythonconverters(./examples/src/main/scala/org/apache/spark/pythonconverters/)
to transform object of Hbase result to python string.
Spark update these two scripts recently. However, they are not included in
the official release of spark. So you are trying to use this new python
script with old jar.

You can clone the newest code of spark from github and build examples jar.
Then you can get correct result.

Cheers
Gen


On Sat, Aug 8, 2015 at 5:03 AM, Eric Bless 
wrote:

> I’m having some difficulty getting the desired results from the Spark
> Python example hbase_inputformat.py. I’m running with CDH5.4, hbase Version
> 1.0.0, Spark v 1.3.0 Using Python version 2.6.6
>
> I followed the example to create a test HBase table. Here’s the data from
> the table I created –
> hbase(main):001:0> scan 'dev_wx_test'
> ROW   COLUMN+CELL
> row1 column=f1:a, timestamp=1438716994027, value=value1
> row1 column=f1:b, timestamp=1438717004248, value=value2
> row2 column=f1:, timestamp=1438717014529, value=value3
> row3 column=f1:, timestamp=1438717022756, value=value4
> 3 row(s) in 0.2620 seconds
>
> When either of these statements are included -
> “hbase_rdd = hbase_rdd.flatMapValues(lambda v: v.split("\n"))”  or
> “hbase_rdd = hbase_rdd.flatMapValues(lambda v:
> v.split("\n")).countByValue().items()” the result is -
> We only get the following printed; (row1, value2) is not printed:
> ((u'row1', u'value1'), 1) ((u'row2', u'value3'), 1)
> ((u'row3', u'value4'), 1)
>  This looks like similar results to the following post I found -
>
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-td18613.html#a18650
> but it appears the pythonconverter HBaseResultToStringConverter has been
> updated since then.
>
And this problem will be resolved too.


>
> When the statement
> “hbase_rdd = hbase_rdd.flatMapValues(lambda v:
> v.split("\n")).mapValues(json.loads)” is included, the result is –
> ValueError: No JSON object could be decoded
>
>
> **
> Here is more info on this from the log –
> Traceback (most recent call last):
>   File "hbase_inputformat.py", line 87, in 
> output = hbase_rdd.collect()
>   File
> "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py",
> line 701, in collect
>   File
> "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o44.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 1.0 (TID 4, stluhdpddev27.monsanto.com):
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
>   File
> "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py",
> line 101, in main
> process()
>   File
> "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py",
> line 96, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File
> "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/serializers.py",
> line 236, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File
> "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py",
> line 1807, in 
>   File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads
> return _default_decoder.decode(s)
>   File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
> obj, end = self.raw_decode(s, idx=_w(s, 0).end())
>   File "/usr/lib64/python2.6/json/decoder.py", line 338, in raw_decode
> raise ValueError("No JSON object could be decoded")
> ValueError: No JSON object could be decoded
>
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:176)
> at
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Tas

Checkpoint Dir Error in Yarn

2015-08-07 Thread Mohit Anchlia
I am running in yarn-client mode and trying to execute network word count
example. When I connect through nc I see the following in spark app logs:

Exception in thread "main" java.lang.AssertionError: assertion failed: The
checkpoint directory has not been set. Please use
StreamingContext.checkpoint() or SparkContext.checkpoint() to set the
checkpoint directory.
at scala.Predef$.assert(Predef.scala:179)
at
org.apache.spark.streaming.dstream.DStream.validate(DStream.scala:183)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$validate$10.apply(DStream.scala:229)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$validate$10.apply(DStream.scala:229)
at scala.collection


Re: How to get total CPU consumption for Spark job

2015-08-07 Thread gen tang
Hi,

Spark UI or logs don't provide the situation of cluster. However, you can
use Ganglia to monitor the situation of cluster. In spark-ec2, there is an
option to install ganglia automatically.

If you use CDH, you can also use Cloudera manager.

Cheers
Gen


On Sat, Aug 8, 2015 at 6:06 AM, Xiao JIANG  wrote:

> Hi all,
>
>
> I was running some Hive/spark job on hadoop cluster.  I want to see how
> spark helps improve not only the elapsed time but also the total CPU
> consumption.
>
>
> For Hive, I can get the 'Total MapReduce CPU Time Spent' from the log when
> the job finishes. But I didn't find any CPU stats for Spark jobs from
> either spark log or web UI. Is there any place I can find the total CPU
> consumption for my spark job? Thanks!
>
>
> Here is the version info: Spark version 1.3.0 Using Scala version 2.10.4,
> Java 1.7.0_67
>
>
> Thanks!
>
> Xiao
>


Re: Spark failed while trying to read parquet files

2015-08-07 Thread Jerrick Hoang
Yes! I was being dumb, should have caught that earlier, thank you Cheng Lian

On Fri, Aug 7, 2015 at 4:25 PM, Cheng Lian  wrote:

> It doesn't seem to be Parquet 1.7.0 since the package name isn't under
> "org.apache.parquet" (1.7.0 is the first official Apache release of
> Parquet). The version you were using is probably Parquet 1.6.0rc3 according
> to the line number information:
> https://github.com/apache/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java#L249
> And you're hitting PARQUET-136, which has been fixed in (the real) Parquet
> 1.7.0 https://issues.apache.org/jira/browse/PARQUET-136
>
> Cheng
>
>
> On 8/8/15 6:20 AM, Jerrick Hoang wrote:
>
> Hi all,
>
> I have a partitioned parquet table (very small table with only 2
> partitions). The version of spark is 1.4.1, parquet version is 1.7.0. I
> applied this patch to spark [SPARK-7743] so I assume that spark can read
> parquet files normally, however, I'm getting this when trying to do a
> simple `select count(*) from table`,
>
> ```org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 29 in stage 44.0 failed 15 times, most recent failure: Lost task 29.14 in
> stage 44.0: java.lang.NullPointerException
> at
> parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
> at
> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
> at
> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
> at
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
> at
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)```
>
> Has anybody seen this before?
>
> Thanks
>
>
>


Re: Spark failed while trying to read parquet files

2015-08-07 Thread Cheng Lian
It doesn't seem to be Parquet 1.7.0 since the package name isn't under 
"org.apache.parquet" (1.7.0 is the first official Apache release of 
Parquet). The version you were using is probably Parquet 1.6.0rc3 
according to the line number information: 
https://github.com/apache/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java#L249 
And you're hitting PARQUET-136, which has been fixed in (the real) 
Parquet 1.7.0 https://issues.apache.org/jira/browse/PARQUET-136


Cheng

On 8/8/15 6:20 AM, Jerrick Hoang wrote:

Hi all,

I have a partitioned parquet table (very small table with only 2 
partitions). The version of spark is 1.4.1, parquet version is 1.7.0. 
I applied this patch to spark [SPARK-7743] so I assume that spark can 
read parquet files normally, however, I'm getting this when trying to 
do a simple `select count(*) from table`,


```org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 29 in stage 44.0 failed 15 times, most recent failure: Lost task 
29.14 in stage 44.0: java.lang.NullPointerException
at 
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
at 
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
at 
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)

at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
at 
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)```

Has anybody seen this before?

Thanks




does dstream.transform() run on the driver node?

2015-08-07 Thread lookfwd
Hello, here's a simple program that demonstrates my problem:



Is "keyavg = rdd.values().reduce(sum) / rdd.count()" inside stats calculated
one time per partition or it's just once? I guess another way to ask the
same question is DStream.transform() is called on the driver node or not?

What would be an alternative way to do this two step computation without
calculating the average many times? I guess I could do it in a foreachRDD()
block but it doesn't seem appropriate given that this is more of a a
transform than an action.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/does-dstream-transform-run-on-the-driver-node-tp24176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Hien,
Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin
going to use for workflow scheduling? Is there something else that's going
to replace Azkaban?

On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu  wrote:

> In my opinion, choosing some particular project among its peers should
> leave enough room for future growth (which may come faster than you
> initially think).
>
> Cheers
>
> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu  wrote:
>
>> Scalability is a known issue due the the current architecture.  However
>> this will be applicable if you run more 20K jobs per day.
>>
>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu  wrote:
>>
>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>>> being phased out at LinkedIn because of scalability issues (though UI-wise,
>>> Azkaban seems better).
>>>
>>> Vikram:
>>> I suggest you do more research in related projects (maybe using their
>>> mailing lists).
>>>
>>> Disclaimer: I don't work for LinkedIn.
>>>
>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
 Hi Vikram,

 We use Azkaban (2.5.0) in our production workflow scheduling. We just
 use local mode deployment and it is fairly easy to set up. It is pretty
 easy to use and has a nice scheduling and logging interface, as well as
 SLAs (like kill job and notify if it doesn't complete in 3 hours or
 whatever).

 However Spark support is not present directly - we run everything with
 shell scripts and spark-submit. There is a plugin interface where one could
 create a Spark plugin, but I found it very cumbersome when I did
 investigate and didn't have the time to work through it to develop that.

 It has some quirks and while there is actually a REST API for adding
 jos and dynamically scheduling jobs, it is not documented anywhere so you
 kinda have to figure it out for yourself. But in terms of ease of use I
 found it way better than Oozie. I haven't tried Chronos, and it seemed
 quite involved to set up. Haven't tried Luigi either.

 Spark job server is good but as you say lacks some stuff like
 scheduling and DAG type workflows (independent of spark-defined job flows).


 On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke 
 wrote:

> Check also falcon in combination with oozie
>
> Le ven. 7 août 2015 à 17:51, Hien Luu  a
> écrit :
>
>> Looks like Oozie can satisfy most of your requirements.
>>
>>
>>
>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone 
>> wrote:
>>
>>> Hi,
>>> I'm looking for open source workflow tools/engines that allow us to
>>> schedule spark jobs on a datastax cassandra cluster. Since there are 
>>> tonnes
>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>> wanted to check with people here to see what they are using today.
>>>
>>> Some of the requirements of the workflow engine that I'm looking for
>>> are
>>>
>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>> some wrapper Java code to submit tasks.
>>> 2. Active open source community support and well tested at
>>> production scale.
>>> 3. Should be dead easy to write job dependencices using XML or web
>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B 
>>> and
>>> C are finished. Don't need to write full blown java applications to 
>>> specify
>>> job parameters and dependencies. Should be very simple to use.
>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>> time every hour or day or week or month.
>>> 5. Job monitoring, alerting on failures and email notifications on
>>> daily basis.
>>>
>>> I have looked at Ooyala's spark job server which seems to be hated
>>> towards making spark jobs run faster by sharing contexts between the 
>>> jobs
>>> but isn't a full blown workflow engine per se. A combination of spark 
>>> job
>>> server and workflow engine would be ideal
>>>
>>> Thanks for the inputs
>>>
>>
>>

>>>
>>
>


Re: Spark failed while trying to read parquet files

2015-08-07 Thread Philip Weaver
Yes, NullPointerExceptions are pretty common in Spark (or, rather, I seem
to encounter them a lot!) but can occur for a few different reasons. Could
you add some more detail, like what the schema is for the data, or the code
you're using to read it?

On Fri, Aug 7, 2015 at 3:20 PM, Jerrick Hoang 
wrote:

> Hi all,
>
> I have a partitioned parquet table (very small table with only 2
> partitions). The version of spark is 1.4.1, parquet version is 1.7.0. I
> applied this patch to spark [SPARK-7743] so I assume that spark can read
> parquet files normally, however, I'm getting this when trying to do a
> simple `select count(*) from table`,
>
> ```org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 29 in stage 44.0 failed 15 times, most recent failure: Lost task 29.14 in
> stage 44.0: java.lang.NullPointerException
> at
> parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
> at
> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
> at
> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
> at
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
> at
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)```
>
> Has anybody seen this before?
>
> Thanks
>


Re: spark config

2015-08-07 Thread Dean Wampler
That's the correct URL. Recent change? The last time I looked, earlier this
week, it still had the obsolete artifactory URL for URL1 ;)

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Fri, Aug 7, 2015 at 5:19 PM, Ted Yu  wrote:

> In master branch, build/sbt-launch-lib.bash has the following:
>
>   URL1=
> https://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
>
> I verified that the following exists:
>
>
> https://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.7/#sbt-launch.jar
>
> FYI
>
> On Fri, Aug 7, 2015 at 2:08 PM, Bryce Lobdell  wrote:
>
>>
>> I Recently downloaded spark package 1.4.0:
>>
>> A build of Spark with "sbt/sbt clean assembly" failed with message
>> "Error: Invalid or corrupt jarfile build/sbt-launch-0.13.7.jar"
>>
>> Upon investigation I figured out that "sbt-launch-0.13.7.jar" is
>> downloaded at build time and that it contained the the following:
>>
>> 
>> 404 Not Found
>> 
>> 404 Not Found
>> nginx
>> 
>> 
>>
>> which is an HTML error message to the effect that the file is missing
>> (from the web server).
>>
>>
>> The script sbt-launch-lib.bash contains the following lines which
>> determine where the file sbt-launch.jar is downloaded from:
>>
>> acquire_sbt_jar () {
>>   SBT_VERSION=`awk -F "=" '/sbt\.version/ {print $2}'
>> ./project/build.properties`
>>   URL1=
>> http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
>>   URL2=
>> http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
>>   JAR=build/sbt-launch-${SBT_VERSION}.jar
>>
>>
>> The script sbt-launch.bash downloads $URL1 first, and incorrectly
>> concludes that it succeeded on the basis that the file sbt-launch-0.13.7.jar
>> exists (though it contains HTML).
>>
>> I succeeded in building Spark by:
>>
>> (1)  Downloading the file sbt-launch-0.13.7.jar from $URL2 and placing
>> it in the build directory.
>> (2)  Modifying sbt-launch-lib.bash to prevent the download of that file.
>> (3)  Restarting the download as I usually would, with "SPARK_HIVE=true
>> SPARK_HADOOP_VERSION=2.5.1 sbt/sbt clean assembly"
>>
>>
>> I think a lot of people will be confused by this.  Probably someone
>> should do some of the following:
>>
>> (1)  Delete $URL1 and all references, or replace it with the
>> correct/current URL which points to the sbt-launch.jar(s).
>> (2)  Modify sbt-launch-lib.bash, so that it will not conclude that the
>> download of sbt-launch.jar succeeded, when the data returned is an HTML
>> error message.
>>
>>
>> Let me know if this is not clear, I will gladly explain in more detail or
>> with more clarity, if needed.
>>
>> -Bryce Lobdell
>>
>>
>> A transcript of my console is below:
>>
>>
>>
>>
>> @ip-xx-xxx-xx-xxx:~/spark/spark-1.4.0$ SPARK_HIVE=true
>> SPARK_HADOOP_VERSION=2.5.1 sbt/sbt clean assembly
>> NOTE: The sbt/sbt script has been relocated to build/sbt.
>>   Please update references to point to the new location.
>>
>>   Invoking 'build/sbt clean assembly' now ...
>>
>> Using /usr/lib/jvm/java-7-openjdk-amd64/ as default JAVA_HOME.
>> Note, this will be overridden by -java-home if it is set.
>> Attempting to fetch sbt
>> Launching sbt from build/sbt-launch-0.13.7.jar
>> *Error: Invalid or corrupt jarfile build/sbt-launch-0.13.7.jar*
>> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0$ cd build/
>> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls
>> mvn  sbt  sbt-launch-0.13.7.jar  sbt-launch-lib.bash
>> *inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ unzip -l
>> sbt-launch-0.13.7.jar*
>> *Archive:  sbt-launch-0.13.7.jar*
>> *  End-of-central-directory signature not found.  Either this file is not*
>> *  a zipfile, or it constitutes one disk of a multi-part archive.  In the*
>> *  latter case the central directory and zipfile comment will be found on*
>> *  the last disk(s) of this archive.*
>> unzip:  cannot find zipfile directory in one of sbt-launch-0.13.7.jar or
>> sbt-launch-0.13.7.jar.zip, and cannot find
>> sbt-launch-0.13.7.jar.ZIP, period.
>> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls
>> mvn  sbt  sbt-launch-0.13.7.jar  sbt-launch-lib.bash
>> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls -l
>> total 28
>> -rwxr-xr-x 1 inquidia inquidia 5384 Jun  3 01:07 mvn
>> -rwxr-xr-x 1 inquidia inquidia 5395 Jun  3 01:07 sbt
>> -rw-rw-r-- 1 inquidia inquidia  162 Aug  7 20:24 sbt-launch-0.13.7.jar
>> -rwxr-xr-x 1 inquidia inquidia 5285 Jun  3 01:07 sbt-launch-lib.bash
>> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls -l
>> total 28
>> -rwxr-xr-x 1 inquidia inquidia 5384 Jun  3 01:07 mvn
>> -rwxr-xr-x 1 inquidia inquidia 5395 Jun  3 01:07 sbt
>> -rw-rw-r-- 1 inquidia inquidia  *162 *Aug  7 20:24 sbt-l

Re: spark config

2015-08-07 Thread Ted Yu
Looks like Sean fixed it:

[SPARK-9633] [BUILD] SBT download locations outdated; need an update

Cheers

On Fri, Aug 7, 2015 at 3:22 PM, Dean Wampler  wrote:

> That's the correct URL. Recent change? The last time I looked, earlier
> this week, it still had the obsolete artifactory URL for URL1 ;)
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Fri, Aug 7, 2015 at 5:19 PM, Ted Yu  wrote:
>
>> In master branch, build/sbt-launch-lib.bash has the following:
>>
>>   URL1=
>> https://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
>>
>> I verified that the following exists:
>>
>>
>> https://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.7/#sbt-launch.jar
>>
>> FYI
>>
>> On Fri, Aug 7, 2015 at 2:08 PM, Bryce Lobdell  wrote:
>>
>>>
>>> I Recently downloaded spark package 1.4.0:
>>>
>>> A build of Spark with "sbt/sbt clean assembly" failed with message
>>> "Error: Invalid or corrupt jarfile build/sbt-launch-0.13.7.jar"
>>>
>>> Upon investigation I figured out that "sbt-launch-0.13.7.jar" is
>>> downloaded at build time and that it contained the the following:
>>>
>>> 
>>> 404 Not Found
>>> 
>>> 404 Not Found
>>> nginx
>>> 
>>> 
>>>
>>> which is an HTML error message to the effect that the file is missing
>>> (from the web server).
>>>
>>>
>>> The script sbt-launch-lib.bash contains the following lines which
>>> determine where the file sbt-launch.jar is downloaded from:
>>>
>>> acquire_sbt_jar () {
>>>   SBT_VERSION=`awk -F "=" '/sbt\.version/ {print $2}'
>>> ./project/build.properties`
>>>   URL1=
>>> http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
>>>   URL2=
>>> http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
>>>   JAR=build/sbt-launch-${SBT_VERSION}.jar
>>>
>>>
>>> The script sbt-launch.bash downloads $URL1 first, and incorrectly
>>> concludes that it succeeded on the basis that the file sbt-launch-0.13.7.jar
>>> exists (though it contains HTML).
>>>
>>> I succeeded in building Spark by:
>>>
>>> (1)  Downloading the file sbt-launch-0.13.7.jar from $URL2 and placing
>>> it in the build directory.
>>> (2)  Modifying sbt-launch-lib.bash to prevent the download of that file.
>>> (3)  Restarting the download as I usually would, with "SPARK_HIVE=true
>>> SPARK_HADOOP_VERSION=2.5.1 sbt/sbt clean assembly"
>>>
>>>
>>> I think a lot of people will be confused by this.  Probably someone
>>> should do some of the following:
>>>
>>> (1)  Delete $URL1 and all references, or replace it with the
>>> correct/current URL which points to the sbt-launch.jar(s).
>>> (2)  Modify sbt-launch-lib.bash, so that it will not conclude that the
>>> download of sbt-launch.jar succeeded, when the data returned is an HTML
>>> error message.
>>>
>>>
>>> Let me know if this is not clear, I will gladly explain in more detail
>>> or with more clarity, if needed.
>>>
>>> -Bryce Lobdell
>>>
>>>
>>> A transcript of my console is below:
>>>
>>>
>>>
>>>
>>> @ip-xx-xxx-xx-xxx:~/spark/spark-1.4.0$ SPARK_HIVE=true
>>> SPARK_HADOOP_VERSION=2.5.1 sbt/sbt clean assembly
>>> NOTE: The sbt/sbt script has been relocated to build/sbt.
>>>   Please update references to point to the new location.
>>>
>>>   Invoking 'build/sbt clean assembly' now ...
>>>
>>> Using /usr/lib/jvm/java-7-openjdk-amd64/ as default JAVA_HOME.
>>> Note, this will be overridden by -java-home if it is set.
>>> Attempting to fetch sbt
>>> Launching sbt from build/sbt-launch-0.13.7.jar
>>> *Error: Invalid or corrupt jarfile build/sbt-launch-0.13.7.jar*
>>> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0$ cd build/
>>> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls
>>> mvn  sbt  sbt-launch-0.13.7.jar  sbt-launch-lib.bash
>>> *inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ unzip -l
>>> sbt-launch-0.13.7.jar*
>>> *Archive:  sbt-launch-0.13.7.jar*
>>> *  End-of-central-directory signature not found.  Either this file is
>>> not*
>>> *  a zipfile, or it constitutes one disk of a multi-part archive.  In
>>> the*
>>> *  latter case the central directory and zipfile comment will be found
>>> on*
>>> *  the last disk(s) of this archive.*
>>> unzip:  cannot find zipfile directory in one of sbt-launch-0.13.7.jar or
>>> sbt-launch-0.13.7.jar.zip, and cannot find
>>> sbt-launch-0.13.7.jar.ZIP, period.
>>> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls
>>> mvn  sbt  sbt-launch-0.13.7.jar  sbt-launch-lib.bash
>>> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls -l
>>> total 28
>>> -rwxr-xr-x 1 inquidia inquidia 5384 Jun  3 01:07 mvn
>>> -rwxr-xr-x 1 inquidia inquidia 5395 Jun  3 01:07 sbt
>>> -rw-rw-r-- 1 inquidia inquidia  162 Aug  7 20:24 sbt-launch-0.13.7.jar
>>> -rwxr-x

Spark failed while trying to read parquet files

2015-08-07 Thread Jerrick Hoang
Hi all,

I have a partitioned parquet table (very small table with only 2
partitions). The version of spark is 1.4.1, parquet version is 1.7.0. I
applied this patch to spark [SPARK-7743] so I assume that spark can read
parquet files normally, however, I'm getting this when trying to do a
simple `select count(*) from table`,

```org.apache.spark.SparkException: Job aborted due to stage failure: Task
29 in stage 44.0 failed 15 times, most recent failure: Lost task 29.14 in
stage 44.0: java.lang.NullPointerException
at
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
at
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
at
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
at
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)```

Has anybody seen this before?

Thanks


Re: spark config

2015-08-07 Thread Ted Yu
In master branch, build/sbt-launch-lib.bash has the following:

  URL1=
https://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar

I verified that the following exists:

https://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.7/#sbt-launch.jar

FYI

On Fri, Aug 7, 2015 at 2:08 PM, Bryce Lobdell  wrote:

>
> I Recently downloaded spark package 1.4.0:
>
> A build of Spark with "sbt/sbt clean assembly" failed with message "Error:
> Invalid or corrupt jarfile build/sbt-launch-0.13.7.jar"
>
> Upon investigation I figured out that "sbt-launch-0.13.7.jar" is
> downloaded at build time and that it contained the the following:
>
> 
> 404 Not Found
> 
> 404 Not Found
> nginx
> 
> 
>
> which is an HTML error message to the effect that the file is missing
> (from the web server).
>
>
> The script sbt-launch-lib.bash contains the following lines which
> determine where the file sbt-launch.jar is downloaded from:
>
> acquire_sbt_jar () {
>   SBT_VERSION=`awk -F "=" '/sbt\.version/ {print $2}'
> ./project/build.properties`
>   URL1=
> http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
>   URL2=
> http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
>   JAR=build/sbt-launch-${SBT_VERSION}.jar
>
>
> The script sbt-launch.bash downloads $URL1 first, and incorrectly
> concludes that it succeeded on the basis that the file sbt-launch-0.13.7.jar
> exists (though it contains HTML).
>
> I succeeded in building Spark by:
>
> (1)  Downloading the file sbt-launch-0.13.7.jar from $URL2 and placing it
> in the build directory.
> (2)  Modifying sbt-launch-lib.bash to prevent the download of that file.
> (3)  Restarting the download as I usually would, with "SPARK_HIVE=true
> SPARK_HADOOP_VERSION=2.5.1 sbt/sbt clean assembly"
>
>
> I think a lot of people will be confused by this.  Probably someone should
> do some of the following:
>
> (1)  Delete $URL1 and all references, or replace it with the
> correct/current URL which points to the sbt-launch.jar(s).
> (2)  Modify sbt-launch-lib.bash, so that it will not conclude that the
> download of sbt-launch.jar succeeded, when the data returned is an HTML
> error message.
>
>
> Let me know if this is not clear, I will gladly explain in more detail or
> with more clarity, if needed.
>
> -Bryce Lobdell
>
>
> A transcript of my console is below:
>
>
>
>
> @ip-xx-xxx-xx-xxx:~/spark/spark-1.4.0$ SPARK_HIVE=true
> SPARK_HADOOP_VERSION=2.5.1 sbt/sbt clean assembly
> NOTE: The sbt/sbt script has been relocated to build/sbt.
>   Please update references to point to the new location.
>
>   Invoking 'build/sbt clean assembly' now ...
>
> Using /usr/lib/jvm/java-7-openjdk-amd64/ as default JAVA_HOME.
> Note, this will be overridden by -java-home if it is set.
> Attempting to fetch sbt
> Launching sbt from build/sbt-launch-0.13.7.jar
> *Error: Invalid or corrupt jarfile build/sbt-launch-0.13.7.jar*
> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0$ cd build/
> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls
> mvn  sbt  sbt-launch-0.13.7.jar  sbt-launch-lib.bash
> *inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ unzip -l
> sbt-launch-0.13.7.jar*
> *Archive:  sbt-launch-0.13.7.jar*
> *  End-of-central-directory signature not found.  Either this file is not*
> *  a zipfile, or it constitutes one disk of a multi-part archive.  In the*
> *  latter case the central directory and zipfile comment will be found on*
> *  the last disk(s) of this archive.*
> unzip:  cannot find zipfile directory in one of sbt-launch-0.13.7.jar or
> sbt-launch-0.13.7.jar.zip, and cannot find
> sbt-launch-0.13.7.jar.ZIP, period.
> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls
> mvn  sbt  sbt-launch-0.13.7.jar  sbt-launch-lib.bash
> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls -l
> total 28
> -rwxr-xr-x 1 inquidia inquidia 5384 Jun  3 01:07 mvn
> -rwxr-xr-x 1 inquidia inquidia 5395 Jun  3 01:07 sbt
> -rw-rw-r-- 1 inquidia inquidia  162 Aug  7 20:24 sbt-launch-0.13.7.jar
> -rwxr-xr-x 1 inquidia inquidia 5285 Jun  3 01:07 sbt-launch-lib.bash
> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls -l
> total 28
> -rwxr-xr-x 1 inquidia inquidia 5384 Jun  3 01:07 mvn
> -rwxr-xr-x 1 inquidia inquidia 5395 Jun  3 01:07 sbt
> -rw-rw-r-- 1 inquidia inquidia  *162 *Aug  7 20:24 sbt-launch-0.13.7.jar
> -rwxr-xr-x 1 inquidia inquidia 5285 Jun  3 01:07 sbt-launch-lib.bash
> inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ cat
> sbt-launch-0.13.7.jar
> **
> *404 Not Found*
> **
> *404 Not Found*
> *nginx*
> **
> **
>
>
>
>


Re: tachyon

2015-08-07 Thread Abhishek R. Singh
Thanks Calvin - much appreciated !

-Abhishek-

On Aug 7, 2015, at 11:11 AM, Calvin Jia  wrote:

> Hi Abhishek,
> 
> Here's a production use case that may interest you: 
> http://www.meetup.com/Tachyon/events/222485713/
> 
> Baidu is using Tachyon to manage more than 100 nodes in production resulting 
> in a 30x performance improvement for their SparkSQL workload. They are also 
> using the tiered storage feature in Tachyon giving them over 2PB of Tachyon 
> managed space.
> 
> Hope this helps,
> Calvin
> 
> On Fri, Aug 7, 2015 at 10:00 AM, Ted Yu  wrote:
> Looks like you would get better response on Tachyon's mailing list:
> 
> https://groups.google.com/forum/?fromgroups#!forum/tachyon-users
> 
> Cheers
> 
> On Fri, Aug 7, 2015 at 9:56 AM, Abhishek R. Singh 
>  wrote:
> Do people use Tachyon in production, or is it experimental grade still?
> 
> Regards,
> Abhishek
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> 



How to get total CPU consumption for Spark job

2015-08-07 Thread Xiao JIANG
Hi all,
I was running some Hive/spark job on hadoop cluster.  I want to see how spark 
helps improve not only the elapsed time but also the total CPU consumption.
For Hive, I can get the 'Total MapReduce CPU Time Spent' from the log when the 
job finishes. But I didn't find any CPU stats for Spark jobs from either spark 
log or web UI. Is there any place I can find the total CPU consumption for my 
spark job? Thanks!
Here is the version info: Spark version 1.3.0 Using Scala version 2.10.4, Java 
1.7.0_67
Thanks!Xiao   

Accessing S3 files with s3n://

2015-08-07 Thread Akshat Aranya
Hi,

I've been trying to track down some problems with Spark reads being very
slow with s3n:// URIs (NativeS3FileSystem).  After some digging around, I
realized that this file system implementation fetches the entire file,
which isn't really a Spark problem, but it really slows down things when
trying to just read headers from a Parquet file or just creating partitions
in the RDD.  Is this something that others have observed before, or am I
doing something wrong?

Thanks,
Akshat


Fwd: spark config

2015-08-07 Thread Bryce Lobdell
I Recently downloaded spark package 1.4.0:

A build of Spark with "sbt/sbt clean assembly" failed with message "Error:
Invalid or corrupt jarfile build/sbt-launch-0.13.7.jar"

Upon investigation I figured out that "sbt-launch-0.13.7.jar" is downloaded
at build time and that it contained the the following:


404 Not Found

404 Not Found
nginx



which is an HTML error message to the effect that the file is missing (from
the web server).


The script sbt-launch-lib.bash contains the following lines which determine
where the file sbt-launch.jar is downloaded from:

acquire_sbt_jar () {
  SBT_VERSION=`awk -F "=" '/sbt\.version/ {print $2}'
./project/build.properties`
  URL1=
http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
  URL2=
http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
  JAR=build/sbt-launch-${SBT_VERSION}.jar


The script sbt-launch.bash downloads $URL1 first, and incorrectly concludes
that it succeeded on the basis that the file sbt-launch-0.13.7.jar exists
(though it contains HTML).

I succeeded in building Spark by:

(1)  Downloading the file sbt-launch-0.13.7.jar from $URL2 and placing it
in the build directory.
(2)  Modifying sbt-launch-lib.bash to prevent the download of that file.
(3)  Restarting the download as I usually would, with "SPARK_HIVE=true
SPARK_HADOOP_VERSION=2.5.1 sbt/sbt clean assembly"


I think a lot of people will be confused by this.  Probably someone should
do some of the following:

(1)  Delete $URL1 and all references, or replace it with the
correct/current URL which points to the sbt-launch.jar(s).
(2)  Modify sbt-launch-lib.bash, so that it will not conclude that the
download of sbt-launch.jar succeeded, when the data returned is an HTML
error message.


Let me know if this is not clear, I will gladly explain in more detail or
with more clarity, if needed.

-Bryce Lobdell


A transcript of my console is below:




@ip-xx-xxx-xx-xxx:~/spark/spark-1.4.0$ SPARK_HIVE=true
SPARK_HADOOP_VERSION=2.5.1 sbt/sbt clean assembly
NOTE: The sbt/sbt script has been relocated to build/sbt.
  Please update references to point to the new location.

  Invoking 'build/sbt clean assembly' now ...

Using /usr/lib/jvm/java-7-openjdk-amd64/ as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
Attempting to fetch sbt
Launching sbt from build/sbt-launch-0.13.7.jar
*Error: Invalid or corrupt jarfile build/sbt-launch-0.13.7.jar*
inquidia@ip-10-102-69-107:~/spark/spark-1.4.0$ cd build/
inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls
mvn  sbt  sbt-launch-0.13.7.jar  sbt-launch-lib.bash
*inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ unzip -l
sbt-launch-0.13.7.jar*
*Archive:  sbt-launch-0.13.7.jar*
*  End-of-central-directory signature not found.  Either this file is not*
*  a zipfile, or it constitutes one disk of a multi-part archive.  In the*
*  latter case the central directory and zipfile comment will be found on*
*  the last disk(s) of this archive.*
unzip:  cannot find zipfile directory in one of sbt-launch-0.13.7.jar or
sbt-launch-0.13.7.jar.zip, and cannot find
sbt-launch-0.13.7.jar.ZIP, period.
inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls
mvn  sbt  sbt-launch-0.13.7.jar  sbt-launch-lib.bash
inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls -l
total 28
-rwxr-xr-x 1 inquidia inquidia 5384 Jun  3 01:07 mvn
-rwxr-xr-x 1 inquidia inquidia 5395 Jun  3 01:07 sbt
-rw-rw-r-- 1 inquidia inquidia  162 Aug  7 20:24 sbt-launch-0.13.7.jar
-rwxr-xr-x 1 inquidia inquidia 5285 Jun  3 01:07 sbt-launch-lib.bash
inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ ls -l
total 28
-rwxr-xr-x 1 inquidia inquidia 5384 Jun  3 01:07 mvn
-rwxr-xr-x 1 inquidia inquidia 5395 Jun  3 01:07 sbt
-rw-rw-r-- 1 inquidia inquidia  *162 *Aug  7 20:24 sbt-launch-0.13.7.jar
-rwxr-xr-x 1 inquidia inquidia 5285 Jun  3 01:07 sbt-launch-lib.bash
inquidia@ip-10-102-69-107:~/spark/spark-1.4.0/build$ cat
sbt-launch-0.13.7.jar
**
*404 Not Found*
**
*404 Not Found*
*nginx*
**
**


Problems getting expected results from hbase_inputformat.py

2015-08-07 Thread Eric Bless
I’m having some difficulty getting the desired results fromthe Spark Python 
example hbase_inputformat.py. I’m running with CDH5.4, hbaseVersion 1.0.0, 
Spark v 1.3.0 Using Python version 2.6.6
 
I followed the example to create a test HBase table. Here’sthe data from the 
table I created – hbase(main):001:0> scan 'dev_wx_test'ROW  
COLUMN+CELLrow1column=f1:a, timestamp=1438716994027, 
value=value1row1column=f1:b, timestamp=1438717004248, 
value=value2row2column=f1:, timestamp=1438717014529, 
value=value3row3column=f1:, timestamp=1438717022756, 
value=value43 row(s) in 0.2620 seconds
 
When either of these statements are included -“hbase_rdd = 
hbase_rdd.flatMapValues(lambda v:v.split("\n"))”  or “hbase_rdd = 
hbase_rdd.flatMapValues(lambda v:v.split("\n")).countByValue().items()” the 
result is - We only get the following printed; (row1, value2) is notprinted: 
    ((u'row1', u'value1'), 1)    ((u'row2', u'value3'), 1)    
((u'row3', u'value4'), 1)
 
This looks like similar results to the following post Ifound 
-http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-td18613.html#a18650but
 it appears the pythonconverterHBaseResultToStringConverter has been updated 
since then.
 
When the statement “hbase_rdd = hbase_rdd.flatMapValues(lambda 
v:v.split("\n")).mapValues(json.loads)” is included, the result is – 
ValueError: No JSON object could be decoded 
**
 Here is more info on this from the log – Traceback (most recent call last):

  File"hbase_inputformat.py", line 87, in 

    output =hbase_rdd.collect()

  
File"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py",line
 701, in collect

  
File"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/java_gateway.py",line
 538, in __call__

  File 
"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/protocol.py",line
 300, in get_return_value

py4j.protocol.Py4JJavaError: An erroroccurred while calling o44.collect.

: org.apache.spark.SparkException: Jobaborted due to stage failure: Task 0 in 
stage 1.0 failed 4 times, most recentfailure: Lost task 0.3 in stage 1.0 (TID 
4, stluhdpddev27.monsanto.com):org.apache.spark.api.python.PythonException: 
Traceback (most recent call last):

  File 
"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py",line
 101, in main

    process()

  
File"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py",line
 96, in process

   serializer.dump_stream(func(split_index, iterator), outfile)

  
File"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/serializers.py",line
 236, in dump_stream

    vs =list(itertools.islice(iterator, batch))

  
File"/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py",line
 1807, in 

  File"/usr/lib64/python2.6/json/__init__.py", line 307, in loads

    return_default_decoder.decode(s)

  File"/usr/lib64/python2.6/json/decoder.py", line 319, in decode

    obj, end =self.raw_decode(s, idx=_w(s, 0).end())

  File "/usr/lib64/python2.6/json/decoder.py",line 338, in raw_decode

    raiseValueError("No JSON object could be decoded")

ValueError: No JSON object could bedecoded


 
   at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)

   at 
org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:176)

   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)

   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

   at org.apache.spark.scheduler.Task.run(Task.scala:64)

   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

   
atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

   at java.lang.Thread.run(Thread.java:745)


 
Driver stacktrace:

   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)

   
atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

   
atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)

   

Re: SparkSQL: "add jar" blocks all queries

2015-08-07 Thread Wu, James C.
Hi,

The issue only seems to happen when trying to access spark via the SparkSQL 
Thrift Server interface.

Does anyone know a fix?

james

From: , Walt Disney mailto:james.c...@disney.com>>
Date: Friday, August 7, 2015 at 12:40 PM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: SparkSQL: "add jar" blocks all queries

Hi,

I got into a situation where a prior "add jar " command causing Spark SQL stops 
to work for all users.

Does anyone know how to fix the issue?

Regards,

james

From: , Walt Disney mailto:james.c...@disney.com>>
Date: Friday, August 7, 2015 at 10:29 AM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: SparkSQL: remove jar added by "add jar " command from dependencies

Hi,

I am using Spark SQL to run some queries on a set of avro data. Somehow I am 
getting this error

0: jdbc:hive2://n7-z01-0a2a1453> select count(*) from flume_test;

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
3 in stage 26.0 failed 4 times, most recent failure: Lost task 3.3 in stage 
26.0 (TID 1027, n7-z01-0a2a1457.iaas.starwave.com): java.io.IOException: 
Incomplete HDFS URI, no host: hdfs:data/hive-jars/avro-mapred.jar

at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:141)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)

at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1364)

at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:498)

at org.apache.spark.util.Utils$.fetchFile(Utils.scala:383)

at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:350)

at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:347)

at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)

at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)

at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)

at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:347)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)


I did not add the jar in this session, so I am wondering how I can get the jar 
removed from the dependencies so that It is not blocking all my spark sql 
queries for all sessions.

Thanks,

James


Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
Good point; I agree that defaulting to online SGD (single example per
iteration) would be a poor UX due to performance.

On Fri, Aug 7, 2015 at 12:44 PM, Meihua Wu 
wrote:

> Feynman, thanks for clarifying.
>
> If we default miniBatchFraction = (1 / numInstances), then we will
> only hit one row for every iteration of SGD regardless the number of
> partitions and executors. In other words the parallelism provided by
> the RDD is lost in this approach. I think this is something we need to
> consider for the default value of miniBatchFraction.
>
> On Fri, Aug 7, 2015 at 11:24 AM, Feynman Liang 
> wrote:
> > Yep, I think that's what Gerald is saying and they are proposing to
> default
> > miniBatchFraction = (1 / numInstances). Is that correct?
> >
> > On Fri, Aug 7, 2015 at 11:16 AM, Meihua Wu  >
> > wrote:
> >>
> >> I think in the SGD algorithm, the mini batch sample is done without
> >> replacement. So with fraction=1, then all the rows will be sampled
> >> exactly once to form the miniBatch, resulting to the
> >> deterministic/classical case.
> >>
> >> On Fri, Aug 7, 2015 at 9:05 AM, Feynman Liang 
> >> wrote:
> >> > Sounds reasonable to me, feel free to create a JIRA (and PR if you're
> up
> >> > for
> >> > it) so we can see what others think!
> >> >
> >> > On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler
> >> >  wrote:
> >> >>
> >> >> hi,
> >> >>
> >> >> if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0,
> >> >> doesn’t that make it a deterministic/classical gradient descent
> rather
> >> >> than a SGD?
> >> >>
> >> >> Specifically, miniBatchFraction=1.0 means the entire data set, i.e.
> >> >> all rows. In the spirit of SGD, shouldn’t the default be the fraction
> >> >> that results in exactly one row of the data set?
> >> >>
> >> >> thank you
> >> >> gerald
> >> >>
> >> >> --
> >> >> Gerald Loeffler
> >> >> mailto:gerald.loeff...@googlemail.com
> >> >> http://www.gerald-loeffler.net
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> >> For additional commands, e-mail: user-h...@spark.apache.org
> >> >>
> >> >
> >
> >
>


Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread andy petrella
Exactly!

The sharing part is used in the Spark Notebook (this one
)
so we can share stuffs between notebooks which are different SparkContext
(in diff JVM).

OTOH, we have a project that creates micro services on genomics data, for
several reasons we used Tachyon to server genomes cubes (ranges across
genomes), see here .

HTH
andy

On Fri, Aug 7, 2015 at 8:36 PM Calvin Jia  wrote:

> Hi,
>
> Tachyon  manages memory off heap which can
> help prevent long GC pauses. Also, using Tachyon will allow the data to be
> shared between Spark jobs if they use the same dataset.
>
> Here's  a production use
> case where Baidu runs Tachyon to get 30x performance improvement in their
> SparkSQL workload.
>
> Hope this helps,
> Calvin
>
> On Fri, Aug 7, 2015 at 9:42 AM, Muler  wrote:
>
>> Spark is an in-memory engine and attempts to do computation in-memory.
>> Tachyon is memory-centeric distributed storage, OK, but how would that help
>> ran Spark faster?
>>
>
> --
andy


Re: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread François Pelletier

look at
spark.history.ui.port, if you use standalone
spark.yarn.historyServer.address, if you use YARN

in your Spark config file

Mine is located at
/etc/spark/conf/spark-defaults.conf

If you use Apache Ambari you can find this settings in the Spark /
Configs / Advanced spark-defaults tab

François

Le 2015-08-07 15:58, saif.a.ell...@wellsfargo.com a écrit :
>
> Hello, thank you, but that port is unreachable for me. Can you please
> share where can I find that port equivalent in my environment?
>
>  
>
> Thank you
>
> Saif
>
>  
>
> *From:*François Pelletier [mailto:newslett...@francoispelletier.org]
> *Sent:* Friday, August 07, 2015 4:38 PM
> *To:* user@spark.apache.org
> *Subject:* Re: Spark master driver UI: How to keep it after process
> finished?
>
>  
>
> Hi, all spark processes are saved in the Spark History Server
>
> look at your host on port 18080 instead of 4040
>
> François
>
> Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com
>  a écrit :
>
> Hi,
>
>  
>
> A silly question here. The Driver Web UI dies when the
> spark-submit program finish. I would like some time to analyze
> after the program ends, as the page does not refresh it self, when
> I hit F5 I lose all the info.
>
>  
>
> Thanks,
>
> Saif
>
>  
>
>  
>



Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Koen Vantomme


Verzonden vanaf mijn Sony Xperia™-smartphone

 Meihua Wu schreef 

>Feynman, thanks for clarifying.
>
>If we default miniBatchFraction = (1 / numInstances), then we will
>only hit one row for every iteration of SGD regardless the number of
>partitions and executors. In other words the parallelism provided by
>the RDD is lost in this approach. I think this is something we need to
>consider for the default value of miniBatchFraction.
>
>On Fri, Aug 7, 2015 at 11:24 AM, Feynman Liang  wrote:
>> Yep, I think that's what Gerald is saying and they are proposing to default
>> miniBatchFraction = (1 / numInstances). Is that correct?
>>
>> On Fri, Aug 7, 2015 at 11:16 AM, Meihua Wu 
>> wrote:
>>>
>>> I think in the SGD algorithm, the mini batch sample is done without
>>> replacement. So with fraction=1, then all the rows will be sampled
>>> exactly once to form the miniBatch, resulting to the
>>> deterministic/classical case.
>>>
>>> On Fri, Aug 7, 2015 at 9:05 AM, Feynman Liang 
>>> wrote:
>>> > Sounds reasonable to me, feel free to create a JIRA (and PR if you're up
>>> > for
>>> > it) so we can see what others think!
>>> >
>>> > On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler
>>> >  wrote:
>>> >>
>>> >> hi,
>>> >>
>>> >> if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0,
>>> >> doesn’t that make it a deterministic/classical gradient descent rather
>>> >> than a SGD?
>>> >>
>>> >> Specifically, miniBatchFraction=1.0 means the entire data set, i.e.
>>> >> all rows. In the spirit of SGD, shouldn’t the default be the fraction
>>> >> that results in exactly one row of the data set?
>>> >>
>>> >> thank you
>>> >> gerald
>>> >>
>>> >> --
>>> >> Gerald Loeffler
>>> >> mailto:gerald.loeff...@googlemail.com
>>> >> http://www.gerald-loeffler.net
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>> >
>>
>>
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


RE: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Koen Vantomme


Verzonden vanaf mijn Sony Xperia™-smartphone

 saif.a.ell...@wellsfargo.com schreef 

> 
>
>Hello, thank you, but that port is unreachable for me. Can you please share 
>where can I find that port equivalent in my environment?
>
> 
>
>Thank you
>
>Saif
>
> 
>
>From: François Pelletier [mailto:newslett...@francoispelletier.org] 
>Sent: Friday, August 07, 2015 4:38 PM
>To: user@spark.apache.org
>Subject: Re: Spark master driver UI: How to keep it after process finished?
>
> 
>
>Hi, all spark processes are saved in the Spark History Server
>
>look at your host on port 18080 instead of 4040
>
>François
>
>Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com a écrit :
>
>Hi,
>
> 
>
>A silly question here. The Driver Web UI dies when the spark-submit program 
>finish. I would like some time to analyze after the program ends, as the page 
>does not refresh it self, when I hit F5 I lose all the info.
>
> 
>
>Thanks,
>
>Saif
>
> 
>
> 
>


RE: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Saif.A.Ellafi
Hello, thank you, but that port is unreachable for me. Can you please share 
where can I find that port equivalent in my environment?

Thank you
Saif

From: François Pelletier [mailto:newslett...@francoispelletier.org]
Sent: Friday, August 07, 2015 4:38 PM
To: user@spark.apache.org
Subject: Re: Spark master driver UI: How to keep it after process finished?

Hi, all spark processes are saved in the Spark History Server

look at your host on port 18080 instead of 4040

François
Le 2015-08-07 15:26, 
saif.a.ell...@wellsfargo.com a écrit :
Hi,

A silly question here. The Driver Web UI dies when the spark-submit program 
finish. I would like some time to analyze after the program ends, as the page 
does not refresh it self, when I hit F5 I lose all the info.

Thanks,
Saif




Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Meihua Wu
Feynman, thanks for clarifying.

If we default miniBatchFraction = (1 / numInstances), then we will
only hit one row for every iteration of SGD regardless the number of
partitions and executors. In other words the parallelism provided by
the RDD is lost in this approach. I think this is something we need to
consider for the default value of miniBatchFraction.

On Fri, Aug 7, 2015 at 11:24 AM, Feynman Liang  wrote:
> Yep, I think that's what Gerald is saying and they are proposing to default
> miniBatchFraction = (1 / numInstances). Is that correct?
>
> On Fri, Aug 7, 2015 at 11:16 AM, Meihua Wu 
> wrote:
>>
>> I think in the SGD algorithm, the mini batch sample is done without
>> replacement. So with fraction=1, then all the rows will be sampled
>> exactly once to form the miniBatch, resulting to the
>> deterministic/classical case.
>>
>> On Fri, Aug 7, 2015 at 9:05 AM, Feynman Liang 
>> wrote:
>> > Sounds reasonable to me, feel free to create a JIRA (and PR if you're up
>> > for
>> > it) so we can see what others think!
>> >
>> > On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler
>> >  wrote:
>> >>
>> >> hi,
>> >>
>> >> if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0,
>> >> doesn’t that make it a deterministic/classical gradient descent rather
>> >> than a SGD?
>> >>
>> >> Specifically, miniBatchFraction=1.0 means the entire data set, i.e.
>> >> all rows. In the spirit of SGD, shouldn’t the default be the fraction
>> >> that results in exactly one row of the data set?
>> >>
>> >> thank you
>> >> gerald
>> >>
>> >> --
>> >> Gerald Loeffler
>> >> mailto:gerald.loeff...@googlemail.com
>> >> http://www.gerald-loeffler.net
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL: "add jar" blocks all queries

2015-08-07 Thread Wu, James C.
Hi,

I got into a situation where a prior "add jar " command causing Spark SQL stops 
to work for all users.

Does anyone know how to fix the issue?

Regards,

james

From: , Walt Disney mailto:james.c...@disney.com>>
Date: Friday, August 7, 2015 at 10:29 AM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: SparkSQL: remove jar added by "add jar " command from dependencies

Hi,

I am using Spark SQL to run some queries on a set of avro data. Somehow I am 
getting this error

0: jdbc:hive2://n7-z01-0a2a1453> select count(*) from flume_test;

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
3 in stage 26.0 failed 4 times, most recent failure: Lost task 3.3 in stage 
26.0 (TID 1027, n7-z01-0a2a1457.iaas.starwave.com): java.io.IOException: 
Incomplete HDFS URI, no host: hdfs:data/hive-jars/avro-mapred.jar

at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:141)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)

at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1364)

at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:498)

at org.apache.spark.util.Utils$.fetchFile(Utils.scala:383)

at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:350)

at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:347)

at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)

at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)

at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)

at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:347)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)


I did not add the jar in this session, so I am wondering how I can get the jar 
removed from the dependencies so that It is not blocking all my spark sql 
queries for all sessions.

Thanks,

James


Re: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread François Pelletier
Hi, all spark processes are saved in the Spark History Server

look at your host on port 18080 instead of 4040

François

Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com a écrit :
> Hi,
>  
> A silly question here. The Driver Web UI dies when the spark-submit
> program finish. I would like some time to analyze after the program
> ends, as the page does not refresh it self, when I hit F5 I lose all
> the info.
>  
> Thanks,
> Saif
>  



Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Saif.A.Ellafi
Hi,

A silly question here. The Driver Web UI dies when the spark-submit program 
finish. I would like some time to analyze after the program ends, as the page 
does not refresh it self, when I hit F5 I lose all the info.

Thanks,
Saif



Re: Estimate size of Dataframe programatically

2015-08-07 Thread Ted Yu
Have you tried calling SizeEstimator.estimate() on a DataFrame ?

I did the following in REPL:

scala> SizeEstimator.estimate(df)
res1: Long = 17769680

FYI

On Fri, Aug 7, 2015 at 6:48 AM, Srikanth  wrote:

> Hello,
>
> Is there a way to estimate the approximate size of a dataframe? I know we
> can cache and look at the size in UI but I'm trying to do this
> programatically. With RDD, I can sample and sum up size using
> SizeEstimator. Then extrapolate it to the entire RDD. That will give me
> approx size of RDD. With dataframes, its tricky due to columnar storage.
> How do we do it?
>
> On a related note, I see size of RDD object to be ~60MB. Is that the
> footprint of RDD in driver JVM?
>
> scala> val temp = sc.parallelize(Array(1,2,3,4,5,6))
> scala> SizeEstimator.estimate(temp)
> res13: Long = 69507320
>
> Srikanth
>


RE: Spark SQL query AVRO file

2015-08-07 Thread java8964
Good to know that.
Let me research it and give it a try.
Thanks
Yong

From: mich...@databricks.com
Date: Fri, 7 Aug 2015 11:44:48 -0700
Subject: Re: Spark SQL query AVRO file
To: java8...@hotmail.com
CC: user@spark.apache.org

You can register your data as a table using this library and then query it 
using HiveQL
CREATE TEMPORARY TABLE episodes
USING com.databricks.spark.avro
OPTIONS (path "src/test/resources/episodes.avro")
On Fri, Aug 7, 2015 at 11:42 AM, java8964  wrote:



Hi, Michael:
I am not sure how spark-avro can help in this case. 
My understanding is that to use Spark-avro, I have to translate all the logic 
from this big Hive query into Spark code, right?
If I have this big Hive query, how I can use spark-avro to run the query?
Thanks
Yong

From: mich...@databricks.com
Date: Fri, 7 Aug 2015 11:32:21 -0700
Subject: Re: Spark SQL query AVRO file
To: java8...@hotmail.com
CC: user@spark.apache.org

Have you considered trying Spark SQL's native support for avro data?
https://github.com/databricks/spark-avro

On Fri, Aug 7, 2015 at 11:30 AM, java8964  wrote:



Hi, Spark users:
We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production 
cluster, which has 42 data/task nodes.
There is one dataset stored as Avro files about 3T. Our business has a complex 
query running for the dataset, which is stored in nest structure with Array of 
Struct in Avro and Hive.
We can query it using Hive without any problem, but we like the SparkSQL's 
performance, so we in fact run the same query in the Spark SQL, and found out 
it is in fact much faster than Hive.
But when we run it, we got the following error randomly from Spark executors, 
sometime seriously enough to fail the whole spark job.
Below the stack trace, and I think it is a bug related to Spark due to:
1) The error jumps out inconsistent, as sometimes we won't see it for this job. 
(We run it daily)2) Sometime it won't fail our job, as it recover after 
retry.3) Sometime it will fail our job, as I listed below.4) Is this due to the 
multithreading in Spark? The NullPointException indicates Hive got a Null 
ObjectInspector of the children of StructObjectInspector, as I read the Hive 
source code, but I know there is no null of ObjectInsepector as children of 
StructObjectInspector. Google this error didn't give me any hint. Does any one 
know anything like this?
Project 
[HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L,
 StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL 
tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, 
StringType),CAST(active_contact_count_a#16L, 
StringType),CAST(other_api_contact_count_a#6L, 
StringType),CAST(fb_api_contact_count_a#7L, 
StringType),CAST(evm_contact_count_a#8L, 
StringType),CAST(loyalty_contact_count_a#9L, 
StringType),CAST(mobile_jmml_contact_count_a#10L, 
StringType),CAST(savelocal_contact_count_a#11L, 
StringType),CAST(siteowner_contact_count_a#12L, 
StringType),CAST(socialcamp_service_contact_count_a#13L, 
S...org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in stage 1.0 
(TID 257, 10.20.95.146): java.lang.NullPointerExceptionat 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:55)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69)  
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compu

Re: Spark SQL query AVRO file

2015-08-07 Thread Michael Armbrust
You can register your data as a table using this library and then query it
using HiveQL

CREATE TEMPORARY TABLE episodes
USING com.databricks.spark.avro
OPTIONS (path "src/test/resources/episodes.avro")


On Fri, Aug 7, 2015 at 11:42 AM, java8964  wrote:

> Hi, Michael:
>
> I am not sure how spark-avro can help in this case.
>
> My understanding is that to use Spark-avro, I have to translate all the
> logic from this big Hive query into Spark code, right?
>
> If I have this big Hive query, how I can use spark-avro to run the query?
>
> Thanks
>
> Yong
>
> --
> From: mich...@databricks.com
> Date: Fri, 7 Aug 2015 11:32:21 -0700
> Subject: Re: Spark SQL query AVRO file
> To: java8...@hotmail.com
> CC: user@spark.apache.org
>
>
> Have you considered trying Spark SQL's native support for avro data?
>
> https://github.com/databricks/spark-avro
>
> On Fri, Aug 7, 2015 at 11:30 AM, java8964  wrote:
>
> Hi, Spark users:
>
> We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our
> production cluster, which has 42 data/task nodes.
>
> There is one dataset stored as Avro files about 3T. Our business has a
> complex query running for the dataset, which is stored in nest structure
> with Array of Struct in Avro and Hive.
>
> We can query it using Hive without any problem, but we like the SparkSQL's
> performance, so we in fact run the same query in the Spark SQL, and found
> out it is in fact much faster than Hive.
>
> But when we run it, we got the following error randomly from Spark
> executors, sometime seriously enough to fail the whole spark job.
>
> Below the stack trace, and I think it is a bug related to Spark due to:
>
> 1) The error jumps out inconsistent, as sometimes we won't see it for this
> job. (We run it daily)
> 2) Sometime it won't fail our job, as it recover after retry.
> 3) Sometime it will fail our job, as I listed below.
> 4) Is this due to the multithreading in Spark? The NullPointException
> indicates Hive got a Null ObjectInspector of the children of
> StructObjectInspector, as I read the Hive source code, but I know there is
> no null of ObjectInsepector as children of StructObjectInspector. Google
> this error didn't give me any hint. Does any one know anything like this?
>
> Project
> [HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L,
> StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL
> tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L,
> StringType),CAST(active_contact_count_a#16L,
> StringType),CAST(other_api_contact_count_a#6L,
> StringType),CAST(fb_api_contact_count_a#7L,
> StringType),CAST(evm_contact_count_a#8L,
> StringType),CAST(loyalty_contact_count_a#9L,
> StringType),CAST(mobile_jmml_contact_count_a#10L,
> StringType),CAST(savelocal_contact_count_a#11L,
> StringType),CAST(siteowner_contact_count_a#12L,
> StringType),CAST(socialcamp_service_contact_count_a#13L,
> S...org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 58 in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in
> stage 1.0 (TID 257, 10.20.95.146): java.lang.NullPointerException
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139)
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89)
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101)
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117)
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81)
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:55)
> at
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69)
> at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112)
> at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109)
> at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)
> at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>

RE: Spark SQL query AVRO file

2015-08-07 Thread java8964
Hi, Michael:
I am not sure how spark-avro can help in this case. 
My understanding is that to use Spark-avro, I have to translate all the logic 
from this big Hive query into Spark code, right?
If I have this big Hive query, how I can use spark-avro to run the query?
Thanks
Yong

From: mich...@databricks.com
Date: Fri, 7 Aug 2015 11:32:21 -0700
Subject: Re: Spark SQL query AVRO file
To: java8...@hotmail.com
CC: user@spark.apache.org

Have you considered trying Spark SQL's native support for avro data?
https://github.com/databricks/spark-avro

On Fri, Aug 7, 2015 at 11:30 AM, java8964  wrote:



Hi, Spark users:
We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production 
cluster, which has 42 data/task nodes.
There is one dataset stored as Avro files about 3T. Our business has a complex 
query running for the dataset, which is stored in nest structure with Array of 
Struct in Avro and Hive.
We can query it using Hive without any problem, but we like the SparkSQL's 
performance, so we in fact run the same query in the Spark SQL, and found out 
it is in fact much faster than Hive.
But when we run it, we got the following error randomly from Spark executors, 
sometime seriously enough to fail the whole spark job.
Below the stack trace, and I think it is a bug related to Spark due to:
1) The error jumps out inconsistent, as sometimes we won't see it for this job. 
(We run it daily)2) Sometime it won't fail our job, as it recover after 
retry.3) Sometime it will fail our job, as I listed below.4) Is this due to the 
multithreading in Spark? The NullPointException indicates Hive got a Null 
ObjectInspector of the children of StructObjectInspector, as I read the Hive 
source code, but I know there is no null of ObjectInsepector as children of 
StructObjectInspector. Google this error didn't give me any hint. Does any one 
know anything like this?
Project 
[HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L,
 StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL 
tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, 
StringType),CAST(active_contact_count_a#16L, 
StringType),CAST(other_api_contact_count_a#6L, 
StringType),CAST(fb_api_contact_count_a#7L, 
StringType),CAST(evm_contact_count_a#8L, 
StringType),CAST(loyalty_contact_count_a#9L, 
StringType),CAST(mobile_jmml_contact_count_a#10L, 
StringType),CAST(savelocal_contact_count_a#11L, 
StringType),CAST(siteowner_contact_count_a#12L, 
StringType),CAST(socialcamp_service_contact_count_a#13L, 
S...org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in stage 1.0 
(TID 257, 10.20.95.146): java.lang.NullPointerExceptionat 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:55)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69)  
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.

Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Calvin Jia
Hi,

Tachyon  manages memory off heap which can help
prevent long GC pauses. Also, using Tachyon will allow the data to be
shared between Spark jobs if they use the same dataset.

Here's  a production use
case where Baidu runs Tachyon to get 30x performance improvement in their
SparkSQL workload.

Hope this helps,
Calvin

On Fri, Aug 7, 2015 at 9:42 AM, Muler  wrote:

> Spark is an in-memory engine and attempts to do computation in-memory.
> Tachyon is memory-centeric distributed storage, OK, but how would that help
> ran Spark faster?
>


Re: Spark SQL query AVRO file

2015-08-07 Thread Michael Armbrust
Have you considered trying Spark SQL's native support for avro data?

https://github.com/databricks/spark-avro

On Fri, Aug 7, 2015 at 11:30 AM, java8964  wrote:

> Hi, Spark users:
>
> We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our
> production cluster, which has 42 data/task nodes.
>
> There is one dataset stored as Avro files about 3T. Our business has a
> complex query running for the dataset, which is stored in nest structure
> with Array of Struct in Avro and Hive.
>
> We can query it using Hive without any problem, but we like the SparkSQL's
> performance, so we in fact run the same query in the Spark SQL, and found
> out it is in fact much faster than Hive.
>
> But when we run it, we got the following error randomly from Spark
> executors, sometime seriously enough to fail the whole spark job.
>
> Below the stack trace, and I think it is a bug related to Spark due to:
>
> 1) The error jumps out inconsistent, as sometimes we won't see it for this
> job. (We run it daily)
> 2) Sometime it won't fail our job, as it recover after retry.
> 3) Sometime it will fail our job, as I listed below.
> 4) Is this due to the multithreading in Spark? The NullPointException
> indicates Hive got a Null ObjectInspector of the children of
> StructObjectInspector, as I read the Hive source code, but I know there is
> no null of ObjectInsepector as children of StructObjectInspector. Google
> this error didn't give me any hint. Does any one know anything like this?
>
> Project
> [HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L,
> StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL
> tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L,
> StringType),CAST(active_contact_count_a#16L,
> StringType),CAST(other_api_contact_count_a#6L,
> StringType),CAST(fb_api_contact_count_a#7L,
> StringType),CAST(evm_contact_count_a#8L,
> StringType),CAST(loyalty_contact_count_a#9L,
> StringType),CAST(mobile_jmml_contact_count_a#10L,
> StringType),CAST(savelocal_contact_count_a#11L,
> StringType),CAST(siteowner_contact_count_a#12L,
> StringType),CAST(socialcamp_service_contact_count_a#13L,
> S...org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 58 in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in
> stage 1.0 (TID 257, 10.20.95.146): java.lang.NullPointerException
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139)
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89)
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101)
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117)
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81)
> at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:55)
> at
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69)
> at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112)
> at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109)
> at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)
> at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   

Get bucket details created in shuffle phase

2015-08-07 Thread cheez
Hey all.
I was trying to understand Spark Internals by looking in to (and hacking)
the code. 

I was trying to explore the buckets which are generated
when we partition the output of each map task and then let the reduce side
fetch them on the basis of paritionId. I went into the write() method of
SortShuffleWriter and there is an Iterator by the name of records passed in
to it as an argument. This key-value pair is what I though represented the
buckets. But upon exploring its contents I realized that I was wrong because
pairs with same keys were being shown in different buckets which should not
have been the case.

I'd really appreciate if someone could help me find where these buckets
originate.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Get-bucket-details-created-in-shuffle-phase-tp24175.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL query AVRO file

2015-08-07 Thread java8964
Hi, Spark users:
We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production 
cluster, which has 42 data/task nodes.
There is one dataset stored as Avro files about 3T. Our business has a complex 
query running for the dataset, which is stored in nest structure with Array of 
Struct in Avro and Hive.
We can query it using Hive without any problem, but we like the SparkSQL's 
performance, so we in fact run the same query in the Spark SQL, and found out 
it is in fact much faster than Hive.
But when we run it, we got the following error randomly from Spark executors, 
sometime seriously enough to fail the whole spark job.
Below the stack trace, and I think it is a bug related to Spark due to:
1) The error jumps out inconsistent, as sometimes we won't see it for this job. 
(We run it daily)2) Sometime it won't fail our job, as it recover after 
retry.3) Sometime it will fail our job, as I listed below.4) Is this due to the 
multithreading in Spark? The NullPointException indicates Hive got a Null 
ObjectInspector of the children of StructObjectInspector, as I read the Hive 
source code, but I know there is no null of ObjectInsepector as children of 
StructObjectInspector. Google this error didn't give me any hint. Does any one 
know anything like this?
Project 
[HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L,
 StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL 
tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, 
StringType),CAST(active_contact_count_a#16L, 
StringType),CAST(other_api_contact_count_a#6L, 
StringType),CAST(fb_api_contact_count_a#7L, 
StringType),CAST(evm_contact_count_a#8L, 
StringType),CAST(loyalty_contact_count_a#9L, 
StringType),CAST(mobile_jmml_contact_count_a#10L, 
StringType),CAST(savelocal_contact_count_a#11L, 
StringType),CAST(siteowner_contact_count_a#12L, 
StringType),CAST(socialcamp_service_contact_count_a#13L, 
S...org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in stage 1.0 
(TID 257, 10.20.95.146): java.lang.NullPointerExceptionat 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:55)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69)  
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)  
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
   at org.apache.spark.scheduler.Task.run(Task.scala:56)at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:198)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExe

Re: Spark job workflow engine recommendations

2015-08-07 Thread Ted Yu
In my opinion, choosing some particular project among its peers should
leave enough room for future growth (which may come faster than you
initially think).

Cheers

On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu  wrote:

> Scalability is a known issue due the the current architecture.  However
> this will be applicable if you run more 20K jobs per day.
>
> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu  wrote:
>
>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>> being phased out at LinkedIn because of scalability issues (though UI-wise,
>> Azkaban seems better).
>>
>> Vikram:
>> I suggest you do more research in related projects (maybe using their
>> mailing lists).
>>
>> Disclaimer: I don't work for LinkedIn.
>>
>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath > > wrote:
>>
>>> Hi Vikram,
>>>
>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just
>>> use local mode deployment and it is fairly easy to set up. It is pretty
>>> easy to use and has a nice scheduling and logging interface, as well as
>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>> whatever).
>>>
>>> However Spark support is not present directly - we run everything with
>>> shell scripts and spark-submit. There is a plugin interface where one could
>>> create a Spark plugin, but I found it very cumbersome when I did
>>> investigate and didn't have the time to work through it to develop that.
>>>
>>> It has some quirks and while there is actually a REST API for adding jos
>>> and dynamically scheduling jobs, it is not documented anywhere so you kinda
>>> have to figure it out for yourself. But in terms of ease of use I found it
>>> way better than Oozie. I haven't tried Chronos, and it seemed quite
>>> involved to set up. Haven't tried Luigi either.
>>>
>>> Spark job server is good but as you say lacks some stuff like scheduling
>>> and DAG type workflows (independent of spark-defined job flows).
>>>
>>>
>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke 
>>> wrote:
>>>
 Check also falcon in combination with oozie

 Le ven. 7 août 2015 à 17:51, Hien Luu  a
 écrit :

> Looks like Oozie can satisfy most of your requirements.
>
>
>
> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone 
> wrote:
>
>> Hi,
>> I'm looking for open source workflow tools/engines that allow us to
>> schedule spark jobs on a datastax cassandra cluster. Since there are 
>> tonnes
>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>> wanted to check with people here to see what they are using today.
>>
>> Some of the requirements of the workflow engine that I'm looking for
>> are
>>
>> 1. First class support for submitting Spark jobs on Cassandra. Not
>> some wrapper Java code to submit tasks.
>> 2. Active open source community support and well tested at production
>> scale.
>> 3. Should be dead easy to write job dependencices using XML or web
>> interface . Ex; job A depends on Job B and Job C, so run Job A after B 
>> and
>> C are finished. Don't need to write full blown java applications to 
>> specify
>> job parameters and dependencies. Should be very simple to use.
>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>> time every hour or day or week or month.
>> 5. Job monitoring, alerting on failures and email notifications on
>> daily basis.
>>
>> I have looked at Ooyala's spark job server which seems to be hated
>> towards making spark jobs run faster by sharing contexts between the jobs
>> but isn't a full blown workflow engine per se. A combination of spark job
>> server and workflow engine would be ideal
>>
>> Thanks for the inputs
>>
>
>
>>>
>>
>


Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
Yep, I think that's what Gerald is saying and they are proposing to default
miniBatchFraction = (1 / numInstances). Is that correct?

On Fri, Aug 7, 2015 at 11:16 AM, Meihua Wu 
wrote:

> I think in the SGD algorithm, the mini batch sample is done without
> replacement. So with fraction=1, then all the rows will be sampled
> exactly once to form the miniBatch, resulting to the
> deterministic/classical case.
>
> On Fri, Aug 7, 2015 at 9:05 AM, Feynman Liang 
> wrote:
> > Sounds reasonable to me, feel free to create a JIRA (and PR if you're up
> for
> > it) so we can see what others think!
> >
> > On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler
> >  wrote:
> >>
> >> hi,
> >>
> >> if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0,
> >> doesn’t that make it a deterministic/classical gradient descent rather
> >> than a SGD?
> >>
> >> Specifically, miniBatchFraction=1.0 means the entire data set, i.e.
> >> all rows. In the spirit of SGD, shouldn’t the default be the fraction
> >> that results in exactly one row of the data set?
> >>
> >> thank you
> >> gerald
> >>
> >> --
> >> Gerald Loeffler
> >> mailto:gerald.loeff...@googlemail.com
> >> http://www.gerald-loeffler.net
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>


Re: Spark job workflow engine recommendations

2015-08-07 Thread Hien Luu
Scalability is a known issue due the the current architecture.  However
this will be applicable if you run more 20K jobs per day.

On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu  wrote:

> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
> being phased out at LinkedIn because of scalability issues (though UI-wise,
> Azkaban seems better).
>
> Vikram:
> I suggest you do more research in related projects (maybe using their
> mailing lists).
>
> Disclaimer: I don't work for LinkedIn.
>
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath 
> wrote:
>
>> Hi Vikram,
>>
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
>> local mode deployment and it is fairly easy to set up. It is pretty easy to
>> use and has a nice scheduling and logging interface, as well as SLAs (like
>> kill job and notify if it doesn't complete in 3 hours or whatever).
>>
>> However Spark support is not present directly - we run everything with
>> shell scripts and spark-submit. There is a plugin interface where one could
>> create a Spark plugin, but I found it very cumbersome when I did
>> investigate and didn't have the time to work through it to develop that.
>>
>> It has some quirks and while there is actually a REST API for adding jos
>> and dynamically scheduling jobs, it is not documented anywhere so you kinda
>> have to figure it out for yourself. But in terms of ease of use I found it
>> way better than Oozie. I haven't tried Chronos, and it seemed quite
>> involved to set up. Haven't tried Luigi either.
>>
>> Spark job server is good but as you say lacks some stuff like scheduling
>> and DAG type workflows (independent of spark-defined job flows).
>>
>>
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke  wrote:
>>
>>> Check also falcon in combination with oozie
>>>
>>> Le ven. 7 août 2015 à 17:51, Hien Luu  a
>>> écrit :
>>>
 Looks like Oozie can satisfy most of your requirements.



 On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone 
 wrote:

> Hi,
> I'm looking for open source workflow tools/engines that allow us to
> schedule spark jobs on a datastax cassandra cluster. Since there are 
> tonnes
> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
> wanted to check with people here to see what they are using today.
>
> Some of the requirements of the workflow engine that I'm looking for
> are
>
> 1. First class support for submitting Spark jobs on Cassandra. Not
> some wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production
> scale.
> 3. Should be dead easy to write job dependencices using XML or web
> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
> C are finished. Don't need to write full blown java applications to 
> specify
> job parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given
> time every hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on
> daily basis.
>
> I have looked at Ooyala's spark job server which seems to be hated
> towards making spark jobs run faster by sharing contexts between the jobs
> but isn't a full blown workflow engine per se. A combination of spark job
> server and workflow engine would be ideal
>
> Thanks for the inputs
>


>>
>


Re: Amazon DynamoDB & Spark

2015-08-07 Thread Yasemin Kaya
Thanx Jay.

2015-08-07 19:25 GMT+03:00 Jay Vyas :

> In general the simplest way is that you can use the Dynamo Java API as is
> and call it inside  a map(), and use the asynchronous put() Dynamo api call
> .
>
>
> > On Aug 7, 2015, at 9:08 AM, Yasemin Kaya  wrote:
> >
> > Hi,
> >
> > Is there a way using DynamoDB in spark application? I have to persist my
> results to DynamoDB.
> >
> > Thanx,
> > yasemin
> >
> > --
> > hiç ender hiç
>



-- 
hiç ender hiç


Fwd: [Spark + Hive + EMR + S3] Issue when reading from Hive external table backed on S3 with large amount of small files

2015-08-07 Thread Roberto Coluccio
Please community, I'd really appreciate your opinion on this topic.

Best regards,
Roberto


-- Forwarded message --
From: Roberto Coluccio 
Date: Sat, Jul 25, 2015 at 6:28 PM
Subject: [Spark + Hive + EMR + S3] Issue when reading from Hive external
table backed on S3 with large amount of small files
To: user@spark.apache.org


Hello Spark community,

I currently have a Spark 1.3.1 batch driver, deployed in YARN-cluster mode
on an EMR cluster (AMI 3.7.0) that reads input data through an HiveContext,
in particular SELECTing data from an EXTERNAL TABLE backed on S3. Such
table has dynamic partitions and contains *hundreds of small GZip files*.
Considering at the moment unfeasible to collate such files on the source
side, I experience that, by default, the SELECT query is mapped by Spark
into as much tasks as many files are found in the table root
path(+partitions), e.g. 860 files === 860 tasks to complete the Spark stage
of that read operation.

This behaviour obviously creates an incredible overhead and, often, in
failed stages due to OOM exceptions and subsequent crashes of the
executors. Regardless the size of the input that I can manage to handle, I
would really appreciate if you could suggest how to collate somehow the
input partitions while reading, or, at least, reduce the number of tasks
spawned by the Hive query.

Looking at
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-differences.html#emr-hive-gzip-splits
I tried by setting:

hiveContext.sql("set hive.hadoop.supports.splittable.combineinputformat=true)


before creating the external table to read from and query it, but it
resulted in NO changes. Tried also to set that in the hive-site.xml on the
cluster, but I experienced the same behaviour.

Thanks to whomever will give me any hints.

Best regards,
Roberto


Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Meihua Wu
I think in the SGD algorithm, the mini batch sample is done without
replacement. So with fraction=1, then all the rows will be sampled
exactly once to form the miniBatch, resulting to the
deterministic/classical case.

On Fri, Aug 7, 2015 at 9:05 AM, Feynman Liang  wrote:
> Sounds reasonable to me, feel free to create a JIRA (and PR if you're up for
> it) so we can see what others think!
>
> On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler
>  wrote:
>>
>> hi,
>>
>> if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0,
>> doesn’t that make it a deterministic/classical gradient descent rather
>> than a SGD?
>>
>> Specifically, miniBatchFraction=1.0 means the entire data set, i.e.
>> all rows. In the spirit of SGD, shouldn’t the default be the fraction
>> that results in exactly one row of the data set?
>>
>> thank you
>> gerald
>>
>> --
>> Gerald Loeffler
>> mailto:gerald.loeff...@googlemail.com
>> http://www.gerald-loeffler.net
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: tachyon

2015-08-07 Thread Calvin Jia
Hi Abhishek,

Here's a production use case that may interest you:
http://www.meetup.com/Tachyon/events/222485713/


Baidu is using Tachyon  to manage more than 100
nodes in production resulting in a 30x performance improvement for their
SparkSQL workload. They are also using the tiered storage feature in
Tachyon giving them over 2PB of Tachyon managed space.

Hope this helps,
Calvin

On Fri, Aug 7, 2015 at 10:00 AM, Ted Yu  wrote:

> Looks like you would get better response on Tachyon's mailing list:
>
> https://groups.google.com/forum/?fromgroups#!forum/tachyon-users
>
> Cheers
>
> On Fri, Aug 7, 2015 at 9:56 AM, Abhishek R. Singh <
> abhis...@tetrationanalytics.com> wrote:
>
>> Do people use Tachyon in production, or is it experimental grade still?
>>
>> Regards,
>> Abhishek
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Igor Berman
check on which ip/port master listens
netstat -a -t --numeric-ports


On 7 August 2015 at 20:48, Jeff Jones  wrote:

> Thanks. Added this to both the client and the master but still not getting
> any more information. I confirmed the flag with ps.
>
>
>
> jjones53222  2.7  0.1 19399412 549656 pts/3 Sl   17:17   0:44
> /opt/jdk1.8/bin/java -cp
> /home/jjones/bin/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar
> -Dsun.io.serialization.extendedDebugInfo=true -Xms512m -Xmx512m
> org.apache.spark.deploy.master.Master --ip p3.ourdomain.com --port 7077
> --webui-port 8080’
>
>
>
> Error message(s) the same:
>
>
>
> 15/08/07 17:23:26 ERROR Remoting: org.apache.spark.deploy.Command; local
> class incompatible: stream classdesc serialVersionUID =
> -7098307370860582211, local class serialVersionUID = -3335312719467547622
>
> java.io.InvalidClassException: org.apache.spark.deploy.Command; local
> class incompatible: stream classdesc serialVersionUID =
> -7098307370860582211, local class serialVersionUID = -3335312719467547622
>
> at
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:621)
>
> at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
>
> at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>
> at
> akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
>
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>
> at
> akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
>
> at
> akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
>
> at scala.util.Try$.apply(Try.scala:161)
>
> at
> akka.serialization.Serialization.deserialize(Serialization.scala:98)
>
> at
> akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:63)
>
> at
> akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
>
> at scala.util.Try$.apply(Try.scala:161)
>
> at
> akka.serialization.Serialization.deserialize(Serialization.scala:98)
>
> at
> akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
>
> at
> akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
>
> at
> akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
>
> at
> akka.remote.DefaultMessageDispatcher.payloadClass$1(Endpoint.scala:59)
>
> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:99)
>
> at
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
>
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>
> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
>
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>
>
>
> *From:* Sonal Goyal [mailto:sonalgoy...@gmail.com]
> *Sent:* Thursday, August 6, 2015 11:22 PM
> *To:* Jeff Jones
> *Cc:* user@spark.apache.org
> *Subject:* Re: All masters are unresponsive! Giving up.
>
>
>
> There seems  to be a version mismatch somewhere. You can try and find out
> the cause with debug serialization information. I think the jvm flag
> -Dsun.io.*serialization*.*extendedDebugInfo*=true should help.
>
>
> Best Regards,
> Sonal
> Founder, Nube Technologies 
>
> Check out Reifier at Spa

Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Ted Yu
Spark 1.4.1 depends on:
2.3.4-spark

Is it possible that your standalone cluster has another version of akka ?

Cheers

On Fri, Aug 7, 2015 at 10:48 AM, Jeff Jones 
wrote:

> Thanks. Added this to both the client and the master but still not getting
> any more information. I confirmed the flag with ps.
>
>
>
> jjones53222  2.7  0.1 19399412 549656 pts/3 Sl   17:17   0:44
> /opt/jdk1.8/bin/java -cp
> /home/jjones/bin/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar
> -Dsun.io.serialization.extendedDebugInfo=true -Xms512m -Xmx512m
> org.apache.spark.deploy.master.Master --ip p3.ourdomain.com --port 7077
> --webui-port 8080’
>
>
>
> Error message(s) the same:
>
>
>
> 15/08/07 17:23:26 ERROR Remoting: org.apache.spark.deploy.Command; local
> class incompatible: stream classdesc serialVersionUID =
> -7098307370860582211, local class serialVersionUID = -3335312719467547622
>
> java.io.InvalidClassException: org.apache.spark.deploy.Command; local
> class incompatible: stream classdesc serialVersionUID =
> -7098307370860582211, local class serialVersionUID = -3335312719467547622
>
> at
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:621)
>
> at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
>
> at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>
> at
> akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
>
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>
> at
> akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
>
> at
> akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
>
> at scala.util.Try$.apply(Try.scala:161)
>
> at
> akka.serialization.Serialization.deserialize(Serialization.scala:98)
>
> at
> akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:63)
>
> at
> akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
>
> at scala.util.Try$.apply(Try.scala:161)
>
> at
> akka.serialization.Serialization.deserialize(Serialization.scala:98)
>
> at
> akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
>
> at
> akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
>
> at
> akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
>
> at
> akka.remote.DefaultMessageDispatcher.payloadClass$1(Endpoint.scala:59)
>
> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:99)
>
> at
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
>
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>
> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
>
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>
>
>
> *From:* Sonal Goyal [mailto:sonalgoy...@gmail.com]
> *Sent:* Thursday, August 6, 2015 11:22 PM
> *To:* Jeff Jones
> *Cc:* user@spark.apache.org
> *Subject:* Re: All masters are unresponsive! Giving up.
>
>
>
> There seems  to be a version mismatch somewhere. You can try and find out
> the cause with debug serialization information. I think the jvm flag
> -Dsun.io.*serialization*.*extendedDebugInfo*=true should help.
>
>
> Best Regards,
> Sonal
> Founder, Nube Techno

Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Oh ok. That's a good enough reason against azkaban then. So looks like
Oozie is the best choice here.

On Friday, August 7, 2015, Ted Yu  wrote:

> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
> being phased out at LinkedIn because of scalability issues (though UI-wise,
> Azkaban seems better).
>
> Vikram:
> I suggest you do more research in related projects (maybe using their
> mailing lists).
>
> Disclaimer: I don't work for LinkedIn.
>
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath  > wrote:
>
>> Hi Vikram,
>>
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
>> local mode deployment and it is fairly easy to set up. It is pretty easy to
>> use and has a nice scheduling and logging interface, as well as SLAs (like
>> kill job and notify if it doesn't complete in 3 hours or whatever).
>>
>> However Spark support is not present directly - we run everything with
>> shell scripts and spark-submit. There is a plugin interface where one could
>> create a Spark plugin, but I found it very cumbersome when I did
>> investigate and didn't have the time to work through it to develop that.
>>
>> It has some quirks and while there is actually a REST API for adding jos
>> and dynamically scheduling jobs, it is not documented anywhere so you kinda
>> have to figure it out for yourself. But in terms of ease of use I found it
>> way better than Oozie. I haven't tried Chronos, and it seemed quite
>> involved to set up. Haven't tried Luigi either.
>>
>> Spark job server is good but as you say lacks some stuff like scheduling
>> and DAG type workflows (independent of spark-defined job flows).
>>
>>
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke > > wrote:
>>
>>> Check also falcon in combination with oozie
>>>
>>> Le ven. 7 août 2015 à 17:51, Hien Luu  a
>>> écrit :
>>>
 Looks like Oozie can satisfy most of your requirements.



 On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone >>> > wrote:

> Hi,
> I'm looking for open source workflow tools/engines that allow us to
> schedule spark jobs on a datastax cassandra cluster. Since there are 
> tonnes
> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
> wanted to check with people here to see what they are using today.
>
> Some of the requirements of the workflow engine that I'm looking for
> are
>
> 1. First class support for submitting Spark jobs on Cassandra. Not
> some wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production
> scale.
> 3. Should be dead easy to write job dependencices using XML or web
> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
> C are finished. Don't need to write full blown java applications to 
> specify
> job parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given
> time every hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on
> daily basis.
>
> I have looked at Ooyala's spark job server which seems to be hated
> towards making spark jobs run faster by sharing contexts between the jobs
> but isn't a full blown workflow engine per se. A combination of spark job
> server and workflow engine would be ideal
>
> Thanks for the inputs
>


>>
>


[Spark Streaming] Session based windowing like in google dataflow

2015-08-07 Thread Ankur Chauhan
Hi all,

I am trying to figure out how to perform equivalent of "Session windows" (as 
mentioned in https://cloud.google.com/dataflow/model/windowing) using spark 
streaming. Is it even possible (i.e. possible to do efficiently at scale). Just 
to expand on the definition:

Taken from the google dataflow documentation:

The simplest kind of session windowing specifies a minimum gap duration. All 
data arriving below a minimum threshold of time delay is grouped into the same 
window. If data arrives after the minimum specified gap duration time, this 
initiates the start of a new window.




Any help would be appreciated.

-- Ankur Chauhan


signature.asc
Description: Message signed with OpenPGP using GPGMail


RE: All masters are unresponsive! Giving up.

2015-08-07 Thread Jeff Jones
Thanks. Added this to both the client and the master but still not getting any 
more information. I confirmed the flag with ps.

jjones53222  2.7  0.1 19399412 549656 pts/3 Sl   17:17   0:44 
/opt/jdk1.8/bin/java -cp 
/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/jjones/bin/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar
 -Dsun.io.serialization.extendedDebugInfo=true -Xms512m -Xmx512m 
org.apache.spark.deploy.master.Master --ip p3.ourdomain.com --port 7077 
--webui-port 8080’

Error message(s) the same:

15/08/07 17:23:26 ERROR Remoting: org.apache.spark.deploy.Command; local class 
incompatible: stream classdesc serialVersionUID = -7098307370860582211, local 
class serialVersionUID = -3335312719467547622
java.io.InvalidClassException: org.apache.spark.deploy.Command; local class 
incompatible: stream classdesc serialVersionUID = -7098307370860582211, local 
class serialVersionUID = -3335312719467547622
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:621)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
at 
akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
at akka.serialization.Serialization.deserialize(Serialization.scala:98)
at 
akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:63)
at 
akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
at akka.serialization.Serialization.deserialize(Serialization.scala:98)
at 
akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
at 
akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
at 
akka.remote.DefaultMessageDispatcher.payloadClass$1(Endpoint.scala:59)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:99)
at 
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)

From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: Thursday, August 6, 2015 11:22 PM
To: Jeff Jones
Cc: user@spark.apache.org
Subject: Re: All masters are unresponsive! Giving up.

There seems  to be a version mismatch somewhere. You can try and find out the 
cause with debug serialization information. I think the jvm flag 
-Dsun.io.serialization.extendedDebugInfo=true should help.

Best Regards,
Sonal
Founder, Nube Technologies
Check out Reifier at Spark Summit 
2015




On Fri, Aug 7, 2015 at 4:42 AM, Jeff Jones 
mailto:jjo...@adaptivebiotech.com>> wrote:
I wrote a very simple Spark 1.4.1 app that I can run through a local driver 
program just fine using setMaster(“local[*]”).  The app is as follows:

import org.apache.spark.SparkContext
impo

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Eugene Morozov
Would like to add smth, inlined.

On 07 Aug 2015, at 18:51, Eugene Morozov  wrote:

> Hao, 
> 
> I’d say there are few possible ways to achieve that:
> 1. Use KryoSerializer.
>   The flaw of KryoSerializer is that current version (2.21) has an issue with 
> internal state and it might not work for some objects. Spark get kryo 
> dependency as transitive through chill and it’ll not be resolved quickly. 
> Kryo doesn’t work for me (I have such an classes I have to transfer, but do 
> not have their codebase).
> 
> 2. Wrap it into something you have control and make that something 
> serializable.
>   The flaw is kind of obvious - it’s really hard to write serialization for 
> complex objects.
> 
> 3. Tricky algo: don’t do anything that might end up as reshuffle.
>   That’s the way I took. The flow is that we have CSV file as input, parse it 
> and create objects that we cannot serialize / deserialize, thus cannot 
> transfer over the network. Currently we’ve workarounded it so that these 
> objects processed only in those partitions where thye’ve been born. 

That means it’s not possible in debug to call collect() on our RDD (even though 
spark master is local), but there is always a way to get to know that’s inside.
The algo is pretty complex, we still do reshuffle and everything, but before we 
finally create those objects.

> 
> Hope, this helps.
> 
> On 07 Aug 2015, at 12:39, Hao Ren  wrote:
> 
>> Is there any workaround to distribute non-serializable object for RDD 
>> transformation or broadcast variable ?
>> 
>> Say I have an object of class C which is not serializable. Class C is in a 
>> jar package, I have no control on it. Now I need to distribute it either by 
>> rdd transformation or by broadcast. 
>> 
>> I tried to subclass the class C with Serializable interface. It works for 
>> serialization, but deserialization does not work, since there are no 
>> parameter-less constructor for the class C and deserialization is broken 
>> with an invalid constructor exception.
>> 
>> I think it's a common use case. Any help is appreciated.
>> 
>> -- 
>> Hao Ren
>> 
>> Data Engineer @ leboncoin
>> 
>> Paris, France
> 
> Eugene Morozov
> fathers...@list.ru
> 
> 
> 
> 

Eugene Morozov
fathers...@list.ru






SparkSQL: remove jar added by "add jar " command from dependencies

2015-08-07 Thread Wu, James C.
Hi,

I am using Spark SQL to run some queries on a set of avro data. Somehow I am 
getting this error

0: jdbc:hive2://n7-z01-0a2a1453> select count(*) from flume_test;

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
3 in stage 26.0 failed 4 times, most recent failure: Lost task 3.3 in stage 
26.0 (TID 1027, n7-z01-0a2a1457.iaas.starwave.com): java.io.IOException: 
Incomplete HDFS URI, no host: hdfs:data/hive-jars/avro-mapred.jar

at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:141)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)

at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1364)

at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:498)

at org.apache.spark.util.Utils$.fetchFile(Utils.scala:383)

at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:350)

at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:347)

at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)

at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)

at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)

at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:347)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)


I did not add the jar in this session, so I am wondering how I can get the jar 
removed from the dependencies so that It is not blocking all my spark sql 
queries for all sessions.

Thanks,

James


Re: Spark job workflow engine recommendations

2015-08-07 Thread Ted Yu
>From what I heard (an ex-coworker who is Oozie committer), Azkaban is being
phased out at LinkedIn because of scalability issues (though UI-wise,
Azkaban seems better).

Vikram:
I suggest you do more research in related projects (maybe using their
mailing lists).

Disclaimer: I don't work for LinkedIn.

On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath 
wrote:

> Hi Vikram,
>
> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
> local mode deployment and it is fairly easy to set up. It is pretty easy to
> use and has a nice scheduling and logging interface, as well as SLAs (like
> kill job and notify if it doesn't complete in 3 hours or whatever).
>
> However Spark support is not present directly - we run everything with
> shell scripts and spark-submit. There is a plugin interface where one could
> create a Spark plugin, but I found it very cumbersome when I did
> investigate and didn't have the time to work through it to develop that.
>
> It has some quirks and while there is actually a REST API for adding jos
> and dynamically scheduling jobs, it is not documented anywhere so you kinda
> have to figure it out for yourself. But in terms of ease of use I found it
> way better than Oozie. I haven't tried Chronos, and it seemed quite
> involved to set up. Haven't tried Luigi either.
>
> Spark job server is good but as you say lacks some stuff like scheduling
> and DAG type workflows (independent of spark-defined job flows).
>
>
> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke  wrote:
>
>> Check also falcon in combination with oozie
>>
>> Le ven. 7 août 2015 à 17:51, Hien Luu  a
>> écrit :
>>
>>> Looks like Oozie can satisfy most of your requirements.
>>>
>>>
>>>
>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone 
>>> wrote:
>>>
 Hi,
 I'm looking for open source workflow tools/engines that allow us to
 schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
 of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
 wanted to check with people here to see what they are using today.

 Some of the requirements of the workflow engine that I'm looking for are

 1. First class support for submitting Spark jobs on Cassandra. Not some
 wrapper Java code to submit tasks.
 2. Active open source community support and well tested at production
 scale.
 3. Should be dead easy to write job dependencices using XML or web
 interface . Ex; job A depends on Job B and Job C, so run Job A after B and
 C are finished. Don't need to write full blown java applications to specify
 job parameters and dependencies. Should be very simple to use.
 4. Time based  recurrent scheduling. Run the spark jobs at a given time
 every hour or day or week or month.
 5. Job monitoring, alerting on failures and email notifications on
 daily basis.

 I have looked at Ooyala's spark job server which seems to be hated
 towards making spark jobs run faster by sharing contexts between the jobs
 but isn't a full blown workflow engine per se. A combination of spark job
 server and workflow engine would be ideal

 Thanks for the inputs

>>>
>>>
>


Re: Spark job workflow engine recommendations

2015-08-07 Thread Nick Pentreath
Hi Vikram,

We use Azkaban (2.5.0) in our production workflow scheduling. We just use
local mode deployment and it is fairly easy to set up. It is pretty easy to
use and has a nice scheduling and logging interface, as well as SLAs (like
kill job and notify if it doesn't complete in 3 hours or whatever).

However Spark support is not present directly - we run everything with
shell scripts and spark-submit. There is a plugin interface where one could
create a Spark plugin, but I found it very cumbersome when I did
investigate and didn't have the time to work through it to develop that.

It has some quirks and while there is actually a REST API for adding jos
and dynamically scheduling jobs, it is not documented anywhere so you kinda
have to figure it out for yourself. But in terms of ease of use I found it
way better than Oozie. I haven't tried Chronos, and it seemed quite
involved to set up. Haven't tried Luigi either.

Spark job server is good but as you say lacks some stuff like scheduling
and DAG type workflows (independent of spark-defined job flows).


On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke  wrote:

> Check also falcon in combination with oozie
>
> Le ven. 7 août 2015 à 17:51, Hien Luu  a
> écrit :
>
>> Looks like Oozie can satisfy most of your requirements.
>>
>>
>>
>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone  wrote:
>>
>>> Hi,
>>> I'm looking for open source workflow tools/engines that allow us to
>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>> wanted to check with people here to see what they are using today.
>>>
>>> Some of the requirements of the workflow engine that I'm looking for are
>>>
>>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>>> wrapper Java code to submit tasks.
>>> 2. Active open source community support and well tested at production
>>> scale.
>>> 3. Should be dead easy to write job dependencices using XML or web
>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>> C are finished. Don't need to write full blown java applications to specify
>>> job parameters and dependencies. Should be very simple to use.
>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>>> every hour or day or week or month.
>>> 5. Job monitoring, alerting on failures and email notifications on daily
>>> basis.
>>>
>>> I have looked at Ooyala's spark job server which seems to be hated
>>> towards making spark jobs run faster by sharing contexts between the jobs
>>> but isn't a full blown workflow engine per se. A combination of spark job
>>> server and workflow engine would be ideal
>>>
>>> Thanks for the inputs
>>>
>>
>>


Re: tachyon

2015-08-07 Thread Ted Yu
Looks like you would get better response on Tachyon's mailing list:

https://groups.google.com/forum/?fromgroups#!forum/tachyon-users

Cheers

On Fri, Aug 7, 2015 at 9:56 AM, Abhishek R. Singh <
abhis...@tetrationanalytics.com> wrote:

> Do people use Tachyon in production, or is it experimental grade still?
>
> Regards,
> Abhishek
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark job workflow engine recommendations

2015-08-07 Thread Jörn Franke
Check also falcon in combination with oozie

Le ven. 7 août 2015 à 17:51, Hien Luu  a écrit :

> Looks like Oozie can satisfy most of your requirements.
>
>
>
> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone  wrote:
>
>> Hi,
>> I'm looking for open source workflow tools/engines that allow us to
>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>> wanted to check with people here to see what they are using today.
>>
>> Some of the requirements of the workflow engine that I'm looking for are
>>
>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>> wrapper Java code to submit tasks.
>> 2. Active open source community support and well tested at production
>> scale.
>> 3. Should be dead easy to write job dependencices using XML or web
>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>> C are finished. Don't need to write full blown java applications to specify
>> job parameters and dependencies. Should be very simple to use.
>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>> every hour or day or week or month.
>> 5. Job monitoring, alerting on failures and email notifications on daily
>> basis.
>>
>> I have looked at Ooyala's spark job server which seems to be hated
>> towards making spark jobs run faster by sharing contexts between the jobs
>> but isn't a full blown workflow engine per se. A combination of spark job
>> server and workflow engine would be ideal
>>
>> Thanks for the inputs
>>
>
>


tachyon

2015-08-07 Thread Abhishek R. Singh
Do people use Tachyon in production, or is it experimental grade still?

Regards,
Abhishek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to run start-thrift-server in debug mode?

2015-08-07 Thread Benjamin Ross
Hi,
I'm trying to run the hive thrift server in debug mode.  I've tried to simply 
pass -Xdebug 
-Xrunjdwp:transport=dt_socket,address=127.0.0.1:,server=y,suspend=n to 
start-thriftserver.sh as a driver option, but it doesn't seem to host a server. 
 I've then tried to edit the various shell scripts to run hive thrift server 
but couldn't get things to work.  It seems that there must be an easier way to 
do this.  I've also tried to run it directly in eclipse, but ran into issues 
related to Scala that I haven't quite yet figured out.

start-thriftserver.sh --driver-java-options 
"-agentlib:jdwp=transport=dt_socket,address=localhost:8000,server=y,suspend=n 
-XX:MaxPermSize=512"  --master yarn://localhost:9000 --num-executors 2


jdb -attach localhost:8000
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at 
com.sun.tools.jdi.SocketTransportService.attach(SocketTransportService.java:222)
at 
com.sun.tools.jdi.GenericAttachingConnector.attach(GenericAttachingConnector.java:116)
at 
com.sun.tools.jdi.SocketAttachingConnector.attach(SocketAttachingConnector.java:90)
at 
com.sun.tools.example.debug.tty.VMConnection.attachTarget(VMConnection.java:519)
at 
com.sun.tools.example.debug.tty.VMConnection.open(VMConnection.java:328)
at com.sun.tools.example.debug.tty.Env.init(Env.java:63)
at com.sun.tools.example.debug.tty.TTY.main(TTY.java:1066)

Let me know if I'm missing something here...
Thanks in advance,
Ben


Re: Spark MLib v/s SparkR

2015-08-07 Thread Feynman Liang
SparkR and MLlib are becoming more integrated (we recently added R formula
support) but the integration is still quite small. If you learn R and
SparkR, you will not be able to leverage most of the distributed algorithms
in MLlib (e.g. all the algorithms you cited). However, you could use the
equivalent R implementations (e.g. glm for Logistic) but be aware that
these will not scale to the large scale datasets Spark is designed to
handle.

On Thu, Aug 6, 2015 at 8:06 PM, praveen S  wrote:

> I am starting off with classification models, Logistic,RandomForest.
> Basically wanted to learn Machine learning.
> Since I have a java background I started off with MLib, but later heard R
> works as well ( with scaling issues - only).
>
> So, with SparkR was wondering the scaling issue would be resolved - hence
> my question why not go with R and Spark R alone.( keeping aside my
> inclination towards java)
>
> On Thu, Aug 6, 2015 at 12:28 AM, Charles Earl 
> wrote:
>
>> What machine learning algorithms are you interested in exploring or
>> using? Start from there or better yet the problem you are trying to solve,
>> and then the selection may be evident.
>>
>>
>> On Wednesday, August 5, 2015, praveen S  wrote:
>>
>>> I was wondering when one should go for MLib or SparkR. What is the
>>> criteria or what should be considered before choosing either of the
>>> solutions for data analysis?
>>> or What is the advantages of Spark MLib over Spark R or advantages of
>>> SparkR over MLib?
>>>
>>
>>
>> --
>> - Charles
>>
>
>


Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Muler
Spark is an in-memory engine and attempts to do computation in-memory.
Tachyon is memory-centeric distributed storage, OK, but how would that help
ran Spark faster?


Re: Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Corey Nolet
1) Spark only needs to shuffle when data needs to be partitioned around the
workers in an all-to-all fashion.
2) Multi-stage jobs that would normally require several map reduce jobs,
thus causing data to be dumped to disk between the jobs can be cached in
memory.


Re: Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Hien Luu
This blog outlines a few things that make Spark faster than MapReduce -
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

On Fri, Aug 7, 2015 at 9:13 AM, Muler  wrote:

> Consider the classic word count application over a 4 node cluster with a
> sizable working data. What makes Spark ran faster than MapReduce
> considering that Spark also has to write to disk during shuffle?
>


Re: Amazon DynamoDB & Spark

2015-08-07 Thread Jay Vyas
In general the simplest way is that you can use the Dynamo Java API as is and 
call it inside  a map(), and use the asynchronous put() Dynamo api call .


> On Aug 7, 2015, at 9:08 AM, Yasemin Kaya  wrote:
> 
> Hi,
> 
> Is there a way using DynamoDB in spark application? I have to persist my 
> results to DynamoDB.
> 
> Thanx,
> yasemin
> 
> -- 
> hiç ender hiç

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Muler
Consider the classic word count application over a 4 node cluster with a
sizable working data. What makes Spark ran faster than MapReduce
considering that Spark also has to write to disk during shuffle?


Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
Sounds reasonable to me, feel free to create a JIRA (and PR if you're up
for it) so we can see what others think!

On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler <
gerald.loeff...@googlemail.com> wrote:

> hi,
>
> if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0,
> doesn’t that make it a deterministic/classical gradient descent rather
> than a SGD?
>
> Specifically, miniBatchFraction=1.0 means the entire data set, i.e.
> all rows. In the spirit of SGD, shouldn’t the default be the fraction
> that results in exactly one row of the data set?
>
> thank you
> gerald
>
> --
> Gerald Loeffler
> mailto:gerald.loeff...@googlemail.com
> http://www.gerald-loeffler.net
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: distributing large matrices

2015-08-07 Thread Koen Vantomme


Verzonden vanaf mijn Sony Xperia™-smartphone

 iceback schreef 

>Is this the sort of problem spark can accommodate?
>
>I need to compare 10,000 matrices with each other (10^10 comparison).  The
>matrices are 100x10 (10^7 int values).  
>I have 10 machines with 2 to 8 cores (8-32 "processors").
>All machines have to
> - contribute to matrices generation (a simulation, takes seconds)
> - see all matrices
> - compare matrices (takes very little time compared to simulation)
>
>I expect to persist the simulations, have spark push them to processors.
>
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/distributing-large-matrices-tp24174.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Thanks for the suggestion Hien. I'm curious why not azkaban from linkedin.
>From what I read online Oozie was very cumbersome to setup and use compared
to azkaban. Since you are from linkedin wanted to get some perspective on
what it lacks compared to Oozie. Ease of use is very important more than
full feature set

On Friday, August 7, 2015, Hien Luu  wrote:

> Looks like Oozie can satisfy most of your requirements.
>
>
>
> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone  > wrote:
>
>> Hi,
>> I'm looking for open source workflow tools/engines that allow us to
>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>> wanted to check with people here to see what they are using today.
>>
>> Some of the requirements of the workflow engine that I'm looking for are
>>
>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>> wrapper Java code to submit tasks.
>> 2. Active open source community support and well tested at production
>> scale.
>> 3. Should be dead easy to write job dependencices using XML or web
>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>> C are finished. Don't need to write full blown java applications to specify
>> job parameters and dependencies. Should be very simple to use.
>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>> every hour or day or week or month.
>> 5. Job monitoring, alerting on failures and email notifications on daily
>> basis.
>>
>> I have looked at Ooyala's spark job server which seems to be hated
>> towards making spark jobs run faster by sharing contexts between the jobs
>> but isn't a full blown workflow engine per se. A combination of spark job
>> server and workflow engine would be ideal
>>
>> Thanks for the inputs
>>
>
>


RE: Issue when rebroadcasting a variable outside of the definition scope

2015-08-07 Thread Ganelin, Ilya
Simone, here are some thoughts. Please check out the "understanding closures" 
section of the Spark Programming Guide. Secondly, broadcast variables do not 
propagate updates to the underlying data. You must either create a new 
broadcast variable or alternately if you simply wish to accumulate results you 
can use an Accumulator that stores an array or queue as a buffer that you then 
read from to Kafka.

You should also be able to send the results to a new DStream instead, and link 
that DStream to Kafka. Hope this gives you some ideas to play with. Thanks!



Thank you,
Ilya Ganelin



-Original Message-
From: simone.robutti [simone.robu...@gmail.com]
Sent: Friday, August 07, 2015 10:07 AM Eastern Standard Time
To: user@spark.apache.org
Subject: Issue when rebroadcasting a variable outside of the definition scope


Hello everyone,

this is my first message ever to a mailing list so please pardon me if for
some reason I'm violating the etiquette.

I have a problem with rebroadcasting a variable. How it should work is not
well documented so I could find only a few and simple example to understand
how it should work.

What I'm trying to do is to propagate an update to the option for the
behaviour of my streaming transformations (in this case, the evaluation of
machine learning models). I have a listener on a kafka queue that wait for
messages and update the broadcasted variable.

I made it to work but the system doesn't rebroadcast anything if I pass the
DStream or the broadcasted variable as a parameter.

So they must be defined both in the same scope and the rebroadcasting should
happen again in the same scope. Right now my main function looks like this:
--
 var updateVar= sc.broadcast("test")
 val stream=input.map(x => myTransformation(x,updateVar))
 stream.writeToKafka[String, String](outputProps,
(m: String) => new KeyedMessage[String,
String](configuration.outputTopic, m +updateVar.value ))

val controlStream = connector.createMessageStreamsByFilter(filterSpec, 1,
new DefaultDecoder(), new StringDecoder())(0)
for (messageAndTopic <- controlStream) {

println("ricevo")
updateVar.unpersist()
updateVar=ssc.sparkContext.broadcast(messageAndTopic.message)


}

ssc.start()
ssc.awaitTermination()

--

"updateVar" is correctly updated both in "myTransformation" and in the main
scope and I can access the updated value.

But when I try  to do this moving the logic to a class, it fails. I have
something like this (or the same queue listener from before, but moved to
another class):

class Listener(var updateVar: Broadcast[String]){...
def someFunc()={
   updateVar.unpersist()
   updateVar=sc.broadcast("new value")
}
...
}

This fails: the variable can be destroyed but cannot be updated.

Any suggestion on why there is this behaviour? Also I would like to know how
Spark notices the reassignment to var and start the rebroadcasting.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-when-rebroadcasting-a-variable-outside-of-the-definition-scope-tp24172.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Eugene Morozov
Hao, 

I’d say there are few possible ways to achieve that:
1. Use KryoSerializer.
  The flaw of KryoSerializer is that current version (2.21) has an issue with 
internal state and it might not work for some objects. Spark get kryo 
dependency as transitive through chill and it’ll not be resolved quickly. Kryo 
doesn’t work for me (I have such an classes I have to transfer, but do not have 
their codebase).

2. Wrap it into something you have control and make that something serializable.
  The flaw is kind of obvious - it’s really hard to write serialization for 
complex objects.

3. Tricky algo: don’t do anything that might end up as reshuffle.
  That’s the way I took. The flow is that we have CSV file as input, parse it 
and create objects that we cannot serialize / deserialize, thus cannot transfer 
over the network. Currently we’ve workarounded it so that these objects 
processed only in those partitions where thye’ve been born. 

Hope, this helps.

On 07 Aug 2015, at 12:39, Hao Ren  wrote:

> Is there any workaround to distribute non-serializable object for RDD 
> transformation or broadcast variable ?
> 
> Say I have an object of class C which is not serializable. Class C is in a 
> jar package, I have no control on it. Now I need to distribute it either by 
> rdd transformation or by broadcast. 
> 
> I tried to subclass the class C with Serializable interface. It works for 
> serialization, but deserialization does not work, since there are no 
> parameter-less constructor for the class C and deserialization is broken with 
> an invalid constructor exception.
> 
> I think it's a common use case. Any help is appreciated.
> 
> -- 
> Hao Ren
> 
> Data Engineer @ leboncoin
> 
> Paris, France

Eugene Morozov
fathers...@list.ru






Re: Spark job workflow engine recommendations

2015-08-07 Thread Hien Luu
Looks like Oozie can satisfy most of your requirements.



On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone  wrote:

> Hi,
> I'm looking for open source workflow tools/engines that allow us to
> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
> wanted to check with people here to see what they are using today.
>
> Some of the requirements of the workflow engine that I'm looking for are
>
> 1. First class support for submitting Spark jobs on Cassandra. Not some
> wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production
> scale.
> 3. Should be dead easy to write job dependencices using XML or web
> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
> C are finished. Don't need to write full blown java applications to specify
> job parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
> every hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on daily
> basis.
>
> I have looked at Ooyala's spark job server which seems to be hated towards
> making spark jobs run faster by sharing contexts between the jobs but isn't
> a full blown workflow engine per se. A combination of spark job server and
> workflow engine would be ideal
>
> Thanks for the inputs
>


Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Hi,
I'm looking for open source workflow tools/engines that allow us to
schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
wanted to check with people here to see what they are using today.

Some of the requirements of the workflow engine that I'm looking for are

1. First class support for submitting Spark jobs on Cassandra. Not some
wrapper Java code to submit tasks.
2. Active open source community support and well tested at production scale.
3. Should be dead easy to write job dependencices using XML or web
interface . Ex; job A depends on Job B and Job C, so run Job A after B and
C are finished. Don't need to write full blown java applications to specify
job parameters and dependencies. Should be very simple to use.
4. Time based  recurrent scheduling. Run the spark jobs at a given time
every hour or day or week or month.
5. Job monitoring, alerting on failures and email notifications on daily
basis.

I have looked at Ooyala's spark job server which seems to be hated towards
making spark jobs run faster by sharing contexts between the jobs but isn't
a full blown workflow engine per se. A combination of spark job server and
workflow engine would be ideal

Thanks for the inputs


Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Han JU
If the object is something like an utility object (say a DB connection
handler), I often use:

   @transient lazy val someObj = MyFactory.getObj(...)

So basically `@transient` tell the closure cleaner don't serialize this,
and the `lazy val` allows it to be initiated on each executor upon its
first usage (since the class is in your jar so executor should be able to
instantiate it).

2015-08-07 17:20 GMT+02:00 Philip Weaver :

> If the object cannot be serialized, then I don't think broadcast will make
> it magically serializable. You can't transfer data structures between nodes
> without serializing them somehow.
>
> On Fri, Aug 7, 2015 at 7:31 AM, Sujit Pal  wrote:
>
>> Hi Hao,
>>
>> I think sc.broadcast will allow you to broadcast non-serializable
>> objects. According to the scaladocs the Broadcast class itself is
>> Serializable and it wraps your object, allowing you to get it from the
>> Broadcast object using value().
>>
>> Not 100% sure though since I haven't tried broadcasting custom objects
>> but maybe worth trying unless you have already and failed.
>>
>> -sujit
>>
>>
>> On Fri, Aug 7, 2015 at 2:39 AM, Hao Ren  wrote:
>>
>>> Is there any workaround to distribute non-serializable object for RDD
>>> transformation or broadcast variable ?
>>>
>>> Say I have an object of class C which is not serializable. Class C is in
>>> a jar package, I have no control on it. Now I need to distribute it either
>>> by rdd transformation or by broadcast.
>>>
>>> I tried to subclass the class C with Serializable interface. It works
>>> for serialization, but deserialization does not work, since there are no
>>> parameter-less constructor for the class C and deserialization is broken
>>> with an invalid constructor exception.
>>>
>>> I think it's a common use case. Any help is appreciated.
>>>
>>> --
>>> Hao Ren
>>>
>>> Data Engineer @ leboncoin
>>>
>>> Paris, France
>>>
>>
>>
>


-- 
*JU Han*

Software Engineer @ Teads.tv

+33 061960


Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-07 Thread Philip Weaver
Thanks, I also confirmed that the partition discovery is slow by writing a
non-Spark application that uses the parquet library directly to load that
partitions.

It's so slow that my colleague's Python application can read the entire
contents of all the parquet data files faster than my application can even
discover the partitions!

On Fri, Aug 7, 2015 at 2:09 AM, Cheng Lian  wrote:

> However, it's weird that the partition discovery job only spawns 2 tasks.
> It should use the default parallelism, which is probably 8 according to the
> logs of the next Parquet reading job. Partition discovery is already done
> in a distributed manner via a Spark job. But the parallelism is
> mysteriously low...
>
> Cheng
>
>
> On 8/7/15 3:32 PM, Cheng Lian wrote:
>
> Hi Philip,
>
> Thanks for providing the log file. It seems that most of the time are
> spent on partition discovery. The code snippet you provided actually issues
> two jobs. The first one is for listing the input directories to find out
> all leaf directories (and this actually requires listing all leaf files,
> because we can only assert that a directory is a leaf one when it contains
> no sub-directories). Then partition information is extracted from leaf
> directory paths. This process starts at:
>
> 10:51:44 INFO sources.HadoopFsRelation: Listing leaf files and directories
> in parallel under: file:/home/pweaver/work/parquet/day=20150225, …
>
> and ends at:
>
> 10:52:31 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose
> tasks have all completed, from pool
>
> The actual tasks execution time is about 36s:
>
> 10:51:54 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0
> (TID 0, lindevspark5, PROCESS_LOCAL, 3087 bytes)
> …
> 10:52:30 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0
> (TID 0) in 36107 ms on lindevspark5 (1/2)
>
> You mentioned that your dataset has about 40,000+ partitions, so there are
> a lot of leaf directories and files out there. My guess is that the local
> file system spent lots of time listing FileStatus-es of all these files.
>
> I also noticed that Mesos job scheduling takes more time then expected. It
> is probably because this is the first Spark job executed in the
> application, and the system is not warmed up yet. For example, there’s a 6s
> gap between these two adjacent lines:
>
> 10:51:45 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
> 10:51:51 INFO mesos.CoarseMesosSchedulerBackend: Mesos task 0 is now
> TASK_RUNNING
>
> The 2nd Spark job is the real Parquet reading job, and this one actually
> finishes pretty quickly, only 3s (note that the Mesos job scheduling
> latency is also included):
>
> 10:52:32 INFO scheduler.DAGScheduler: Got job 1 (parquet at App.scala:182)
> with 8 output partitions
> …
> 10:52:32 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0
> (TID 2, lindevspark4, PROCESS_LOCAL, 2058 bytes)
> 10:52:32 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0
> (TID 3, lindevspark5, PROCESS_LOCAL, 2058 bytes)
> 10:52:32 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0
> (TID 4, lindevspark4, PROCESS_LOCAL, 2058 bytes)
> …
> 10:52:34 INFO scheduler.TaskSetManager: Finished task 6.0 in stage 1.0
> (TID 8) in 1527 ms on lindevspark4 (6/8)
> 10:52:34 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 1.0
> (TID 6) in 1533 ms on lindevspark4 (7/8)
> 10:52:35 INFO scheduler.TaskSetManager: Finished task 7.0 in stage 1.0
> (TID 9) in 2886 ms on lindevspark5 (8/8)
>
> That might be the reason why you observed that the C parquet library you
> mentioned (is it parquet-cpp?) is an order of magnitude faster?
>
> Cheng
>
> On 8/7/15 2:02 AM, Philip Weaver wrote:
>
> With DEBUG, the log output was over 10MB, so I opted for just INFO output.
> The (sanitized) log is attached.
>
> The driver is essentially this code:
>
> info("A")
>
> val t = System.currentTimeMillis
> val df = sqlContext.read.parquet(dir).select(...).cache
>
> val elapsed = System.currentTimeMillis - t
> info(s"Init time: ${elapsed} ms")
>
> We've also observed that it is very slow to read the contents of the
> parquet files. My colleague wrote a PySpark application that gets the list
> of files, parallelizes it, maps across it and reads each file manually
> using a C parquet library, and aggregates manually in the loop. Ignoring
> the 1-2 minute initialization cost, compared to a Spark SQL or DataFrame
> query in Scala, his is an order of magnitude faster. Since he is
> parallelizing the work through Spark, and that isn't causing any
> performance issues, it seems to be a problem with the parquet reader. I may
> try to do what he did to construct a DataFrame manually, and see if I can
> query it with Spark SQL with reasonable performance.
>
> - Philip
>
>
> On Thu, Aug 6, 2015 at 8:37 AM, Cheng Lian < 
> lian.cs@gmail.com> wrote:
>
>> Would you mind to provide the driver log?
>>
>>
>> On 8/6/15 3:58 PM, Philip Weaver wrote:
>>
>> I built 

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Philip Weaver
If the object cannot be serialized, then I don't think broadcast will make
it magically serializable. You can't transfer data structures between nodes
without serializing them somehow.

On Fri, Aug 7, 2015 at 7:31 AM, Sujit Pal  wrote:

> Hi Hao,
>
> I think sc.broadcast will allow you to broadcast non-serializable objects.
> According to the scaladocs the Broadcast class itself is Serializable and
> it wraps your object, allowing you to get it from the Broadcast object
> using value().
>
> Not 100% sure though since I haven't tried broadcasting custom objects but
> maybe worth trying unless you have already and failed.
>
> -sujit
>
>
> On Fri, Aug 7, 2015 at 2:39 AM, Hao Ren  wrote:
>
>> Is there any workaround to distribute non-serializable object for RDD
>> transformation or broadcast variable ?
>>
>> Say I have an object of class C which is not serializable. Class C is in
>> a jar package, I have no control on it. Now I need to distribute it either
>> by rdd transformation or by broadcast.
>>
>> I tried to subclass the class C with Serializable interface. It works for
>> serialization, but deserialization does not work, since there are no
>> parameter-less constructor for the class C and deserialization is broken
>> with an invalid constructor exception.
>>
>> I think it's a common use case. Any help is appreciated.
>>
>> --
>> Hao Ren
>>
>> Data Engineer @ leboncoin
>>
>> Paris, France
>>
>
>


distributing large matrices

2015-08-07 Thread iceback
Is this the sort of problem spark can accommodate?

I need to compare 10,000 matrices with each other (10^10 comparison).  The
matrices are 100x10 (10^7 int values).  
I have 10 machines with 2 to 8 cores (8-32 "processors").
All machines have to
 - contribute to matrices generation (a simulation, takes seconds)
 - see all matrices
 - compare matrices (takes very little time compared to simulation)

I expect to persist the simulations, have spark push them to processors.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distributing-large-matrices-tp24174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: log4j.xml bundled in jar vs log4.properties in spark/conf

2015-08-07 Thread mlemay
See post for detailed explanation of you problem:
http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tt24159.html




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/log4j-xml-bundled-in-jar-vs-log4-properties-in-spark-conf-tp23923p24173.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Sujit Pal
Hi Hao,

I think sc.broadcast will allow you to broadcast non-serializable objects.
According to the scaladocs the Broadcast class itself is Serializable and
it wraps your object, allowing you to get it from the Broadcast object
using value().

Not 100% sure though since I haven't tried broadcasting custom objects but
maybe worth trying unless you have already and failed.

-sujit


On Fri, Aug 7, 2015 at 2:39 AM, Hao Ren  wrote:

> Is there any workaround to distribute non-serializable object for RDD
> transformation or broadcast variable ?
>
> Say I have an object of class C which is not serializable. Class C is in a
> jar package, I have no control on it. Now I need to distribute it either by
> rdd transformation or by broadcast.
>
> I tried to subclass the class C with Serializable interface. It works for
> serialization, but deserialization does not work, since there are no
> parameter-less constructor for the class C and deserialization is broken
> with an invalid constructor exception.
>
> I think it's a common use case. Any help is appreciated.
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>


Issue when rebroadcasting a variable outside of the definition scope

2015-08-07 Thread simone.robutti
Hello everyone,

this is my first message ever to a mailing list so please pardon me if for
some reason I'm violating the etiquette.

I have a problem with rebroadcasting a variable. How it should work is not
well documented so I could find only a few and simple example to understand
how it should work.

What I'm trying to do is to propagate an update to the option for the
behaviour of my streaming transformations (in this case, the evaluation of
machine learning models). I have a listener on a kafka queue that wait for
messages and update the broadcasted variable. 

I made it to work but the system doesn't rebroadcast anything if I pass the
DStream or the broadcasted variable as a parameter.

So they must be defined both in the same scope and the rebroadcasting should
happen again in the same scope. Right now my main function looks like this:
--
 var updateVar= sc.broadcast("test")
 val stream=input.map(x => myTransformation(x,updateVar))
 stream.writeToKafka[String, String](outputProps,
(m: String) => new KeyedMessage[String,
String](configuration.outputTopic, m +updateVar.value ))

val controlStream = connector.createMessageStreamsByFilter(filterSpec, 1,
new DefaultDecoder(), new StringDecoder())(0)
for (messageAndTopic <- controlStream) {
  
println("ricevo")
updateVar.unpersist()
updateVar=ssc.sparkContext.broadcast(messageAndTopic.message)

  
}

ssc.start()
ssc.awaitTermination()

--

"updateVar" is correctly updated both in "myTransformation" and in the main
scope and I can access the updated value.

But when I try  to do this moving the logic to a class, it fails. I have
something like this (or the same queue listener from before, but moved to
another class):

class Listener(var updateVar: Broadcast[String]){...
def someFunc()={
   updateVar.unpersist()
   updateVar=sc.broadcast("new value")
}
...
}

This fails: the variable can be destroyed but cannot be updated. 

Any suggestion on why there is this behaviour? Also I would like to know how
Spark notices the reassignment to var and start the rebroadcasting.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-when-rebroadcasting-a-variable-outside-of-the-definition-scope-tp24172.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-07 Thread mlemay
Offending commit is :

[SPARK-6014] [core] Revamp Spark shutdown hooks, fix shutdown races.
https://github.com/apache/spark/commit/e72c16e30d85cdc394d318b5551698885cfda9b8





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tp24159p24171.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-07 Thread mlemay
One possible solution is to spark-submit with --driver-class-path and list
all recursive dependencies.  This is fragile and error prone.

Non-working alternatives (used in SparkSubmit.scala AFTER arguments parser
is initialized):

spark-submit --packages ...
spark-submit --jars ...
spark-defaults.conf (spark.driver.extraJavaOptions, spark.jars,
spark.driver.extraClassPath,
...)

On Fri, Aug 7, 2015 at 8:57 AM, mlemay [via Apache Spark User List] <
ml-node+s1001560n24169...@n3.nabble.com> wrote:

> That starts to smell...
>
> When analyzing SparkSubmit.scala, we can see than one of the firsts thing
> it does is to parse arguments. This uses Utils object and triggers
> initialization of member variables.  One such variable is
> ShutdownHookManager (which didn't exists in spark 1.3) with the later log4j
> initialization.
>
> setContextClassLoader is set only a few steps after argument parsing in
> submit > doRunMain > runMain..
>
> That pretty much sums it up:
> spark.util.Utils has a new static dependency on log4j that triggers it's
> initialization before the call to
> setContextClassLoader(MutableURLClassLoader)
>
> Anyone has a workaround to make this work in 1.4.1?
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tp24159p24169.html
> To unsubscribe from log4j custom appender ClassNotFoundException with
> spark 1.4.1, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tp24159p24170.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Estimate size of Dataframe programatically

2015-08-07 Thread Srikanth
Hello,

Is there a way to estimate the approximate size of a dataframe? I know we
can cache and look at the size in UI but I'm trying to do this
programatically. With RDD, I can sample and sum up size using
SizeEstimator. Then extrapolate it to the entire RDD. That will give me
approx size of RDD. With dataframes, its tricky due to columnar storage.
How do we do it?

On a related note, I see size of RDD object to be ~60MB. Is that the
footprint of RDD in driver JVM?

scala> val temp = sc.parallelize(Array(1,2,3,4,5,6))
scala> SizeEstimator.estimate(temp)
res13: Long = 69507320

Srikanth


Amazon DynamoDB & Spark

2015-08-07 Thread Yasemin Kaya
Hi,

Is there a way using DynamoDB in spark application? I have to persist my
results to DynamoDB.

Thanx,
yasemin

-- 
hiç ender hiç


Possible bug: JDBC with Speculative mode launches orphan queries

2015-08-07 Thread Saif.A.Ellafi
Hello,

When enabling speculation, my first job is to launch a partitioned JDBC 
dataframe query, in which some partitions take longer than others to respond.

This causes speculation and creates new nodes to launch the query. When one of 
those nodes finish the query, the speculative one remains forever connected to 
the Database and never ends.
I have to go to the Database management tools and interrupt the query. This 
does not affect the spark program since the JDBC task have already ended. I 
only get the log messages saying that the task execution has been ignored since 
the task in speculation already finished.

Is there any way to enable/disable speculation on a specific task to avoid 
this? do you have any suggestions ? or how can I report this issue?

Saif



Re: log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-07 Thread mlemay
That starts to smell...

When analyzing SparkSubmit.scala, we can see than one of the firsts thing it
does is to parse arguments. This uses Utils object and triggers
initialization of member variables.  One such variable is
ShutdownHookManager (which didn't exists in spark 1.3) with the later log4j
initialization.

setContextClassLoader is set only a few steps after argument parsing in
submit > doRunMain > runMain..

That pretty much sums it up: 
spark.util.Utils has a new static dependency on log4j that triggers it's
initialization before the call to
setContextClassLoader(MutableURLClassLoader)

Anyone has a workaround to make this work in 1.4.1? 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tp24159p24169.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-07 Thread mlemay
Looking at the callstack and diffs between 1.3.1 and 1.4.1-rc4, I see
something that could be relevant to the issue.

1) Callstack tells that log4j manager gets initialized and uses default java
context class loader. This context class loader should probably be
MutableURLClassLoader from spark but it's not.  We can assume that
currentThread.setContextClassLoader has not been called yet.

2) Still in the callstack, we can see that ShutdownHookManager is the class
object responsible to trigger log4j initialization.

3) Looking at the diffs between 1.3 and 1.4, we can see that this
ShutdownHookManager is a new class object.

With this information, is it possible that ShutdownHookManager makes log4j
initialize too early?  By that, I mean before spark gets the chance to set
it's MutableURLClassLoader on thread context?

Let me know if it does not make sense.

Mike




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tp24159p24168.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Issues with Phoenix 4.5

2015-08-07 Thread Nicola Ferraro
Hi all,
I am getting an exception when trying to execute a Spark Job that is using
the new Phoenix 4.5 spark connector. The application works very well in my
local machine, but fails to run in a cluster environment on top of yarn.

The cluster is a Cloudera CDH 5.4.4 with HBase 1.0.0 and Phoenix 4.5
(phoenix is installed correctly as sqlline works without errors).

In the pom.xml, only the spark-core jar (version 1.3.0-cdh5.4.4) has scope
"provided", while all other jars has been copied by maven into the
/myapp/lib folder. I include all the dependent libs using the option
"--jar" in the spark-submit command (among these libraries, there is the
phoenix-core-xx.jar, which contains the class PhoenixOutputFormat).

This is the command:
spark-submit --class my.JobRunner \
--master yarn --deploy-mode client \
--jars `ls -dm /myapp/lib/* | tr -d ' \r\n'` \
/myapp/mainjar.jar
The /myapp/lib folders contains the phoenix core lib, which contains class
org.apache.phoenix.mapreduce.PhoenixOutputFormat. But it seems that the
driver/executor cannot see it.

And I get an exception when I try to save to Phoenix an RDD:

Exception in thread "main" java.lang.RuntimeException:
java.lang.ClassNotFoundException: Class
org.apache.phoenix.mapreduce.PhoenixOutputFormat not found
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2112)
at
org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:232)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:971)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:903)
at
org.apache.phoenix.spark.ProductRDDFunctions.saveToPhoenix(ProductRDDFunctions.scala:51)
at com.mypackage.save(DAOImpl.scala:41)
at com.mypackage.ProtoStreamingJob.execute(ProtoStreamingJob.scala:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.mypackage.SparkApplication.sparkRun(SparkApplication.scala:95)
at
com.mypackage.SparkApplication$delayedInit$body.apply(SparkApplication.scala:112)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at com.mypackage.SparkApplication.main(SparkApplication.scala:15)
at com.mypackage.ProtoStreamingJobRunner.main(ProtoStreamingJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class
org.apache.phoenix.mapreduce.PhoenixOutputFormat not found
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2018)
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2110)
... 30 more


The phoenix-core-xxx.jar is included in the classpath. I am sure it is in
the classpath because I tried to instantiate an object of class
PhoenixOutputFormat directly in the main class and it worked.

The problem is that the method
"org.apache.hadoop.conf.Configuration.getClassByName" cannot find it.
Since I am using "client" deploy mode, the exception should have been
thrown by the driver in the local machine.

How can this happen?


Re: Time series forecasting

2015-08-07 Thread ploffay
Im interested in machine learning on time series.


In our environment we have lot of metric data continuously coming from
agents. Data are stored in Cassandra. Is it possible to set up spark that
would use machine learning on previous data and new incoming data? 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Time-series-forecasting-tp13236p24167.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR -Graphx Connected components

2015-08-07 Thread Robineast
Hi

The graph returned by SCC (strong_graphs in your code) has vertex data where
each vertex in a component is assigned the lowest vertex id of the
component. So if you have 6 vertices (1 to 6) and 2 strongly connected
components (1 and 3, and 2,4,5 and 6) then the strongly connected components
are 1 and 2 (the lowest vertices in each component). So vertices 1 and 3
will have vertex data = 1 and vertices 2,4,5 and 6 will have vertex data 2.

Robin
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/malak/   



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Graphx-Connected-components-tp24165p24166.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Insert operation in Dataframe

2015-08-07 Thread guoqing0...@yahoo.com.hk
Hi all , 
Is the Dataframe support the insert operation , like sqlContext.sql("insert 
into table1 xxx select xxx from table2") ?



guoqing0...@yahoo.com.hk


Re: How to binarize data in spark

2015-08-07 Thread Adamantios Corais
I have ended up with the following piece of code but is turns out to be
really slow... Any other ideas provided that I can only use MLlib 1.2?

val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=>
(x._1 , x._2.toArray)).map{x=>
  var lt : Array[Double] = new Array[Double](test12.size)
  val id = x._1._1
  val cl = x._1._2
  val dt = x._2
  var i = -1
  test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
  val vs = Vectors.dense(lt)
  (id , cl , vs)
}



*// Adamantios*



On Fri, Aug 7, 2015 at 8:36 AM, Yanbo Liang  wrote:

> I think you want to flatten the 1M products to a vector of 1M elements, of
> course mostly are zero.
> It looks like HashingTF
> 
> can help you.
>
> 2015-08-07 11:02 GMT+08:00 praveen S :
>
>> Use StringIndexer in MLib1.4 :
>>
>> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/ml/feature/StringIndexer.html
>>
>> On Thu, Aug 6, 2015 at 8:49 PM, Adamantios Corais <
>> adamantios.cor...@gmail.com> wrote:
>>
>>> I have a set of data based on which I want to create a classification
>>> model. Each row has the following form:
>>>
>>> user1,class1,product1
 user1,class1,product2
 user1,class1,product5
 user2,class1,product2
 user2,class1,product5
 user3,class2,product1
 etc
>>>
>>>
>>> There are about 1M users, 2 classes, and 1M products. What I would like
>>> to do next is create the sparse vectors (something already supported by
>>> MLlib) BUT in order to apply that function I have to create the dense 
>>> vectors
>>> (with the 0s), first. In other words, I have to binarize my data. What's
>>> the easiest (or most elegant) way of doing that?
>>>
>>>
>>> *// Adamantios*
>>>
>>>
>>>
>>
>


RE: Specifying the role when launching an AWS spark cluster using spark_ec2

2015-08-07 Thread Ewan Leith
You'll have a lot less hassle using the AWS EMR instances with Spark 1.4.1 for 
now, until the spark_ec2.py scripts move to Hadoop 2.7.1, at the moment I'm 
pretty sure it's only using Hadoop 2.4

The EMR setup with Spark lets you use s3:// URIs with IAM roles

Ewan

-Original Message-
From: SK [mailto:skrishna...@gmail.com] 
Sent: 06 August 2015 18:27
To: user@spark.apache.org
Subject: Specifying the role when launching an AWS spark cluster using spark_ec2

Hi,

I need to access data on S3 from another account and I have been given the IAM 
role information to access that S3 bucket. From what I understand, AWS allows 
us to attach a role to a resource at the time it is created. However, I don't 
see an option for specifying the role using the spark_ec2.py script. 
So I created a spark cluster using the default role, but I was not able to 
change its IAM role after creation through AWS console.

I see a ticket for this issue:
https://github.com/apache/spark/pull/6962 and the status is closed. 

If anyone knows how I can specify the role using spark_ec2.py, please let me 
know. I am using spark 1.4.1.

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Specifying-the-role-when-launching-an-AWS-spark-cluster-using-spark-ec2-tp24154.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark streaming and session windows

2015-08-07 Thread Ankur Chauhan
Hi all,

I am trying to figure out how to perform equivalent of "Session windows" (as 
mentioned in https://cloud.google.com/dataflow/model/windowing) using spark 
streaming. Is it even possible (i.e. possible to do efficiently at scale). Just 
to expand on the definition:

Taken from the google dataflow documentation:

The simplest kind of session windowing specifies a minimum gap duration. All 
data arriving below a minimum threshold of time delay is grouped into the same 
window. If data arrives after the minimum specified gap duration time, this 
initiates the start of a new window.




Any help would be appreciated.

-- Ankur Chauhan


signature.asc
Description: Message signed with OpenPGP using GPGMail


automatically determine cluster number

2015-08-07 Thread Ziqi Zhang

Hi
I am new to spark and I need to use the clustering functionality to 
process large dataset.


There are between 50k and 1mil objects to cluster. However the problem 
is that the optimal number of clusters is unknown. we cannot even 
estimate a range, except we know there are N objects.


Previously on small dataset I was using R and R's package on calinski 
and harabasz to automatically determine cluster number. But with that 
amount of data R simply breaks.


So I wonder if spark has implemented any algorithms to automatically 
determine the cluster number?


Many thanks!!

--
Ziqi Zhang
Research Associate
Department of Computer Science
University of Sheffield


---
This email has been checked for viruses by Avast antivirus software.
http://www.avast.com


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-07 Thread Peter Rudenko

No, here's an example:

COL1  COL2
a one
b two
a two
c three


StringIndexer.setInputCol(COL1).setOutputCol(SI1) ->

(0-> a, 1->b,2->c)
SI1
0
1
0
2

StringIndexer.setInputCol(COL2).setOutputCol(SI2) ->
(0-> one, 1->two, 2->three)
SI1
0
1
1
2

VectorAssembler.setInputCols(SI1, SI2).setOutputCol(features) ->
features
00
11
01
22


HashingTF.setNumFeatures(2).setInputCol(COL1).setOutputCol(HT1)

bucket1 bucket2
a,a,b   c

HT1
3 //Hash collision
3
3
1

Thanks,
Peter Rudenko
On 2015-08-07 09:55, praveen S wrote:


Is StringIndexer + VectorAssembler equivalent to HashingTF while 
converting the document for analysis?





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: DataFrame column structure change

2015-08-07 Thread Rishabh Bhardwaj
I am doing it by creating a new data frame out of the fields to be nested
and then join with the original DF.
Looking for some optimized solution here.

On Fri, Aug 7, 2015 at 2:06 PM, Rishabh Bhardwaj  wrote:

> Hi all,
>
> I want to have some nesting structure from the existing columns of
> the dataframe.
> For that,,I am trying to transform a DF in the following way,but couldn't
> do it.
>
> scala> df.printSchema
> root
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: string (nullable = true)
>  |-- e: string (nullable = true)
>  |-- f: string (nullable = true)
>
> *To*
>
> scala> newDF.printSchema
> root
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
>  |-- c: string (nullable = true)
>  |-- newCol: struct (nullable = true)
>  ||-- d: string (nullable = true)
>  ||-- e: string (nullable = true)
>
>
> help me.
>
> Regards,
> Rishabh.
>


How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Hao Ren
Is there any workaround to distribute non-serializable object for RDD
transformation or broadcast variable ?

Say I have an object of class C which is not serializable. Class C is in a
jar package, I have no control on it. Now I need to distribute it either by
rdd transformation or by broadcast.

I tried to subclass the class C with Serializable interface. It works for
serialization, but deserialization does not work, since there are no
parameter-less constructor for the class C and deserialization is broken
with an invalid constructor exception.

I think it's a common use case. Any help is appreciated.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France


SparkR -Graphx Connected components

2015-08-07 Thread smagadi
Hi I was trying to use stronglyconnectcomponents ()
Given a DAG is graph I was supposed to get back list of stronglyconnected l 
comps .

def main(args: Array[String])
  {
   val vertexArray = Array(
  (1L, ("Alice", 28)),
  (2L, ("Bob", 27)),
  (3L, ("Charlie", 65)),
  (4L, ("David", 42)),
  (5L, ("Ed", 55)),
  (6L, ("Fran", 50))
  )
val edgeArray = Array(
  Edge(2L, 1L, 7),
  Edge(2L, 4L, 2),
  Edge(3L, 2L, 4),
  Edge(3L, 6L, 3),
  Edge(4L, 1L, 1),
  Edge(5L, 2L, 2),
  Edge(5L, 3L, 8),
  Edge(5L, 6L, 3)
  )
val sc = new SparkContext("local", "readLoCSH", "127.0.0.1")
val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
  
 val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)
  
 val strong_graphs: Graph[VertexId, Int]=
graph.stronglyConnectedComponents(10).


help needed in completing the code.I do not know from now on how to get
stronglyconnected nodes .Pls help in completing this code/





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Graphx-Connected-components-tp24165.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



  1   2   >