Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-01 Thread Vadim Semenov
oh, and try to run even smaller executors, i.e. with
`spark.executor.memory` <= 16GiB. I wonder what result you're going to get.

On Sun, Oct 2, 2016 at 1:24 AM, Vadim Semenov 
wrote:

> > Do you mean running a multi-JVM 'cluster' on the single machine?
> Yes, that's what I suggested.
>
> You can get some information here:
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-
> apache-spark-jobs-part-2/
>
> > How would that affect performance/memory-consumption? If a multi-JVM
> setup can handle such a large input, then why can't a single-JVM break down
> the job into smaller tasks?
> I don't have an answer to these questions, it requires understanding of
> Spark, JVM, and your setup internal.
>
> I ran into the same issue only once when I tried to read a gzipped file
> which size was >16GiB. That's the only time I had to meet this
> https://github.com/apache/spark/blob/5d84c7fd83502aeb551d46a740502d
> b4862508fe/core/src/main/java/org/apache/spark/memory/
> TaskMemoryManager.java#L238-L243
> In the end I had to recompress my file into bzip2 that is splittable to be
> able to read it with spark.
>
>
> I'd look into size of your files and if they're huge I'd try to connect
> the error you got to the size of the files (but it's strange to me as a
> block size of a Parquet file is 128MiB). I don't have any other
> suggestions, I'm sorry.
>
>
> On Sat, Oct 1, 2016 at 11:35 PM, Babak Alipour 
> wrote:
>
>> Do you mean running a multi-JVM 'cluster' on the single machine? How
>> would that affect performance/memory-consumption? If a multi-JVM setup
>> can handle such a large input, then why can't a single-JVM break down the
>> job into smaller tasks?
>>
>> I also found that SPARK-9411 mentions making the page_size configurable
>> but it's hard-limited to ((1L << 31) - 1) * 8L [1]
>>
>> [1] https://github.com/apache/spark/blob/master/core/src/main/
>> java/org/apache/spark/memory/TaskMemoryManager.java
>>
>> ​Spark-9452 also talks about larger page sizes but I don't know how that
>> affects my use case.​ [2]
>>
>> [2] https://github.com/apache/spark/pull/7891
>>
>>
>> ​The reason provided here is that the on-heap allocator's maximum page
>> size is limited by the maximum amount of data that can be stored in a
>> long[]​.
>> Is it possible to force this specific operation to go off-heap so that it
>> can possibly use a bigger page size?
>>
>>
>>
>> ​>Babak​
>>
>>
>> *Babak Alipour ,*
>> *University of Florida*
>>
>> On Fri, Sep 30, 2016 at 3:03 PM, Vadim Semenov <
>> vadim.seme...@datadoghq.com> wrote:
>>
>>> Run more smaller executors: change `spark.executor.memory` to 32g and
>>> `spark.executor.cores` to 2-4, for example.
>>>
>>> Changing driver's memory won't help because it doesn't participate in
>>> execution.
>>>
>>> On Fri, Sep 30, 2016 at 2:58 PM, Babak Alipour 
>>> wrote:
>>>
 Thank you for your replies.

 @Mich, using LIMIT 100 in the query prevents the exception but given
 the fact that there's enough memory, I don't think this should happen even
 without LIMIT.

 @Vadim, here's the full stack trace:

 Caused by: java.lang.IllegalArgumentException: Cannot allocate a page
 with more than 17179869176 bytes
 at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskM
 emoryManager.java:241)
 at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryCo
 nsumer.java:121)
 at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS
 orter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
 at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS
 orter.insertRecord(UnsafeExternalSorter.java:396)
 at org.apache.spark.sql.execution.UnsafeExternalRowSorter.inser
 tRow(UnsafeExternalRowSorter.java:94)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
 eratedIterator.sort_addToSorter$(Unknown Source)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
 eratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
 eratedIterator.processNext(Unknown Source)
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(B
 ufferedRowIterator.java:43)
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfu
 n$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:40
 8)
 at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.w
 rite(BypassMergeSortShuffleWriter.java:125)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
 Task.scala:79)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
 Task.scala:47)
 at 

Re: get different results when debugging and running scala program

2016-10-01 Thread Vadim Semenov
The question has no connection to spark.

In future, if you use apache mailing lists, use external services to add
screenshots and make sure that your code is formatted so other members'd be
able to read it.

On Fri, Sep 30, 2016 at 11:25 AM, chen yong  wrote:

> Hello All,
>
>
>
> I am using IDEA 15.0.4 to debug a scala program. It is strange to me that
> the results were different when I debug or run the program. The differences
> can be seen in the attached filed run.jpg and debug.jpg. The code lines
> of the scala program are shown below.
>
>
> Thank you all
>
>
> ---
>
> import scala.collection.mutable.ArrayBuffer
>
> object TestCase1{
> def func(test:Iterator[(Int,Long)]): Iterator[(Int,Long)]={
> println("in")
> val test1=test.flatmap{
> case(item,count)=>
> val newPrefix=item
> println(count)
> val a=Iterator.single((newPrefix,count))
> func(a)
> val c = a
> c
> }
> test1
> }
> def main(args: Array[String]){
> val freqItems = ArrayBuffer((2,3L),(3,2L),(4,1L))
> val test = freqItems.toIterator
> val result = func(test)
> val reer = result.toArray
> }
> }
>
>
>
>


Re: Spark on yarn enviroment var

2016-10-01 Thread Vadim Semenov
The question should be addressed to the oozie community.

As far as I remember, a spark action doesn't have support of env variables.

On Fri, Sep 30, 2016 at 8:11 PM, Saurabh Malviya (samalviy) <
samal...@cisco.com> wrote:

> Hi,
>
>
>
> I am running spark on yarn using oozie.
>
>
>
> When submit through command line using spark-submit spark is able to read
> env variable.  But while submit through oozie its not able toget env
> variable and don’t see driver log.
>
>
>
> Is there any way we specify env variable in oozie spark action.
>
>
>
> Saurabh
>


Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-01 Thread Vadim Semenov
> Do you mean running a multi-JVM 'cluster' on the single machine?
Yes, that's what I suggested.

You can get some information here:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

> How would that affect performance/memory-consumption? If a multi-JVM
setup can handle such a large input, then why can't a single-JVM break down
the job into smaller tasks?
I don't have an answer to these questions, it requires understanding of
Spark, JVM, and your setup internal.

I ran into the same issue only once when I tried to read a gzipped file
which size was >16GiB. That's the only time I had to meet this
https://github.com/apache/spark/blob/5d84c7fd83502aeb551d46a740502db4862508fe/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L238-L243
In the end I had to recompress my file into bzip2 that is splittable to be
able to read it with spark.


I'd look into size of your files and if they're huge I'd try to connect the
error you got to the size of the files (but it's strange to me as a block
size of a Parquet file is 128MiB). I don't have any other suggestions, I'm
sorry.


On Sat, Oct 1, 2016 at 11:35 PM, Babak Alipour 
wrote:

> Do you mean running a multi-JVM 'cluster' on the single machine? How would
> that affect performance/memory-consumption? If a multi-JVM setup can
> handle such a large input, then why can't a single-JVM break down the job
> into smaller tasks?
>
> I also found that SPARK-9411 mentions making the page_size configurable
> but it's hard-limited to ((1L << 31) - 1) * 8L [1]
>
> [1] https://github.com/apache/spark/blob/master/core/src/
> main/java/org/apache/spark/memory/TaskMemoryManager.java
>
> ​Spark-9452 also talks about larger page sizes but I don't know how that
> affects my use case.​ [2]
>
> [2] https://github.com/apache/spark/pull/7891
>
>
> ​The reason provided here is that the on-heap allocator's maximum page
> size is limited by the maximum amount of data that can be stored in a
> long[]​.
> Is it possible to force this specific operation to go off-heap so that it
> can possibly use a bigger page size?
>
>
>
> ​>Babak​
>
>
> *Babak Alipour ,*
> *University of Florida*
>
> On Fri, Sep 30, 2016 at 3:03 PM, Vadim Semenov <
> vadim.seme...@datadoghq.com> wrote:
>
>> Run more smaller executors: change `spark.executor.memory` to 32g and
>> `spark.executor.cores` to 2-4, for example.
>>
>> Changing driver's memory won't help because it doesn't participate in
>> execution.
>>
>> On Fri, Sep 30, 2016 at 2:58 PM, Babak Alipour 
>> wrote:
>>
>>> Thank you for your replies.
>>>
>>> @Mich, using LIMIT 100 in the query prevents the exception but given the
>>> fact that there's enough memory, I don't think this should happen even
>>> without LIMIT.
>>>
>>> @Vadim, here's the full stack trace:
>>>
>>> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page
>>> with more than 17179869176 bytes
>>> at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskM
>>> emoryManager.java:241)
>>> at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryCo
>>> nsumer.java:121)
>>> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS
>>> orter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
>>> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS
>>> orter.insertRecord(UnsafeExternalSorter.java:396)
>>> at org.apache.spark.sql.execution.UnsafeExternalRowSorter.inser
>>> tRow(UnsafeExternalRowSorter.java:94)
>>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>>> eratedIterator.sort_addToSorter$(Unknown Source)
>>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>>> eratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
>>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>>> eratedIterator.processNext(Unknown Source)
>>> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(B
>>> ufferedRowIterator.java:43)
>>> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfu
>>> n$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:40
>>> 8)
>>> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.w
>>> rite(BypassMergeSortShuffleWriter.java:125)
>>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>>> Task.scala:79)
>>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>>> Task.scala:47)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:85)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.s
>>> cala:274)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1142)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>

use CrossValidatorModel for prediction

2016-10-01 Thread Pengcheng
Dear Spark Users,

I was wondering.

I have a trained crossvalidator model
*model: CrossValidatorModel*

I wan to predict a score for  *features: RDD[Features]*

Right now I have to convert features to dataframe and then perform
predictions as following:

"""
val sqlContext = new SQLContext(features.context)
val input: DataFrame = sqlContext.createDataFrame(features.map(x =>
(Vectors.dense(x.getArray),1.0) )).toDF("features", "label")
model.transform(input)
"""

*i wonder If there is any API I can use to performance prediction on each
individual Features*

*For example, features.map( x => model.predict(x) ) *



a big thank you!

peter


Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-01 Thread Babak Alipour
To add one more note, I tried running more smaller executors each with
32-64g memory and executor.cores 2-4 (with 2 workers as well) and I'm still
getting the same exception:

java.lang.IllegalArgumentException: Cannot allocate a page with more than
17179869176 bytes
at
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
at
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:343)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:393)
at
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

>Babak

*Babak Alipour ,*
*University of Florida*

On Sat, Oct 1, 2016 at 11:35 PM, Babak Alipour 
wrote:

> Do you mean running a multi-JVM 'cluster' on the single machine? How would
> that affect performance/memory-consumption? If a multi-JVM setup can
> handle such a large input, then why can't a single-JVM break down the job
> into smaller tasks?
>
> I also found that SPARK-9411 mentions making the page_size configurable
> but it's hard-limited to ((1L << 31) - 1) * 8L [1]
>
> [1] https://github.com/apache/spark/blob/master/core/src/
> main/java/org/apache/spark/memory/TaskMemoryManager.java
>
> ​Spark-9452 also talks about larger page sizes but I don't know how that
> affects my use case.​ [2]
>
> [2] https://github.com/apache/spark/pull/7891
>
>
> ​The reason provided here is that the on-heap allocator's maximum page
> size is limited by the maximum amount of data that can be stored in a
> long[]​.
> Is it possible to force this specific operation to go off-heap so that it
> can possibly use a bigger page size?
>
>
>
> ​>Babak​
>
>
> *Babak Alipour ,*
> *University of Florida*
>
> On Fri, Sep 30, 2016 at 3:03 PM, Vadim Semenov <
> vadim.seme...@datadoghq.com> wrote:
>
>> Run more smaller executors: change `spark.executor.memory` to 32g and
>> `spark.executor.cores` to 2-4, for example.
>>
>> Changing driver's memory won't help because it doesn't participate in
>> execution.
>>
>> On Fri, Sep 30, 2016 at 2:58 PM, Babak Alipour 
>> wrote:
>>
>>> Thank you for your replies.
>>>
>>> @Mich, using LIMIT 100 in the query prevents the exception but given the
>>> fact that there's enough memory, I don't think this should happen even
>>> without LIMIT.
>>>
>>> @Vadim, here's the full stack trace:
>>>
>>> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page
>>> with more than 17179869176 bytes
>>> at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskM
>>> emoryManager.java:241)
>>> at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryCo
>>> nsumer.java:121)
>>> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS
>>> orter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
>>> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS
>>> orter.insertRecord(UnsafeExternalSorter.java:396)
>>> at org.apache.spark.sql.execution.UnsafeExternalRowSorter.inser
>>> tRow(UnsafeExternalRowSorter.java:94)
>>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>>> eratedIterator.sort_addToSorter$(Unknown Source)
>>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>>> eratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
>>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>>> eratedIterator.processNext(Unknown 

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-01 Thread Babak Alipour
Do you mean running a multi-JVM 'cluster' on the single machine? How would
that affect performance/memory-consumption? If a multi-JVM setup can handle
such a large input, then why can't a single-JVM break down the job into
smaller tasks?

I also found that SPARK-9411 mentions making the page_size configurable but
it's hard-limited to ((1L << 31) - 1) * 8L [1]

[1]
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java

​Spark-9452 also talks about larger page sizes but I don't know how that
affects my use case.​ [2]

[2] https://github.com/apache/spark/pull/7891


​The reason provided here is that the on-heap allocator's maximum page size
is limited by the maximum amount of data that can be stored in a long[]​.
Is it possible to force this specific operation to go off-heap so that it
can possibly use a bigger page size?



​>Babak​


*Babak Alipour ,*
*University of Florida*

On Fri, Sep 30, 2016 at 3:03 PM, Vadim Semenov 
wrote:

> Run more smaller executors: change `spark.executor.memory` to 32g and
> `spark.executor.cores` to 2-4, for example.
>
> Changing driver's memory won't help because it doesn't participate in
> execution.
>
> On Fri, Sep 30, 2016 at 2:58 PM, Babak Alipour 
> wrote:
>
>> Thank you for your replies.
>>
>> @Mich, using LIMIT 100 in the query prevents the exception but given the
>> fact that there's enough memory, I don't think this should happen even
>> without LIMIT.
>>
>> @Vadim, here's the full stack trace:
>>
>> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page
>> with more than 17179869176 bytes
>> at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskM
>> emoryManager.java:241)
>> at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryCo
>> nsumer.java:121)
>> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS
>> orter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
>> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS
>> orter.insertRecord(UnsafeExternalSorter.java:396)
>> at org.apache.spark.sql.execution.UnsafeExternalRowSorter.inser
>> tRow(UnsafeExternalRowSorter.java:94)
>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.sort_addToSorter$(Unknown Source)
>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.processNext(Unknown Source)
>> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(B
>> ufferedRowIterator.java:43)
>> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfu
>> n$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.w
>> rite(BypassMergeSortShuffleWriter.java:125)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:79)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:47)
>> at org.apache.spark.scheduler.Task.run(Task.scala:85)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.s
>> cala:274)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> I'm running spark in local mode so there is only one executor, the driver
>> and spark.driver.memory is set to 64g. Changing the driver's memory doesn't
>> help.
>>
>> *Babak Alipour ,*
>> *University of Florida*
>>
>> On Fri, Sep 30, 2016 at 2:05 PM, Vadim Semenov <
>> vadim.seme...@datadoghq.com> wrote:
>>
>>> Can you post the whole exception stack trace?
>>> What are your executor memory settings?
>>>
>>> Right now I assume that it happens in UnsafeExternalRowSorter ->
>>> UnsafeExternalSorter:insertRecord
>>>
>>> Running more executors with lower `spark.executor.memory` should help.
>>>
>>>
>>> On Fri, Sep 30, 2016 at 12:57 PM, Babak Alipour >> > wrote:
>>>
 Greetings everyone,

 I'm trying to read a single field of a Hive table stored as Parquet in
 Spark (~140GB for the entire table, this single field should be just a few
 GB) and look at the sorted output using the following:

 sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC")

 ​But this simple line of code gives:

 Caused by: java.lang.IllegalArgumentException: Cannot allocate a page
 with more than 17179869176 bytes

 Same error for:

 sql("SELECT " + field + " FROM MY_TABLE).sort(field)

 and:

 sql("SELECT " + field + 

Re: Restful WS for Spark

2016-10-01 Thread Vadim Semenov
I worked with both, so I'll give you some insight from my perspective.

spark-jobserver has stable API and overall mature but doesn't work with
yarn-cluster mode and python support is in-development right now.

Livy has stable API (but I'm not sure if I can speak for it since it has
appeared recently but considering that Cloudera is behind of it, I'd say
it's mature), supports all deployment modes and has support for python & R.

spark-jobserver has nicer UI and better features overall for logging and
richer API.

I had hard time adding more features to spark-jobserver than into Livy, so
if it's something you would need to do you should consider that.

In some cases, spark-jobserver runs into issues that are difficult to debug
because of Akka.

Spark-jobserver is better if you use Scala and you need to work with shared
spark contexts as it has built-in API for shared RDDs and objects.

Both don't support High-Availability, but Livy has few open active PRs.

I'd start with spark-jobserver as it's easier.

On Sat, Oct 1, 2016 at 8:46 AM, ABHISHEK  wrote:

> Thanks Vadim.
> I looked on Spark job server but not sure about session management, will
> job run in  Hadoop cluster ?
> How stable is this API as we will need to implement it in production env.
> Livy looks more promising but still need not matured.
> Have you tested any of them ?
>
> Thanks,
> Abhishek
> Abhishek
>
>
> On Fri, Sep 30, 2016 at 11:39 PM, Vadim Semenov <
> vadim.seme...@datadoghq.com> wrote:
>
>> There're two REST job servers that work with spark:
>>
>> https://github.com/spark-jobserver/spark-jobserver
>>
>> https://github.com/cloudera/livy
>>
>>
>> On Fri, Sep 30, 2016 at 2:07 PM, ABHISHEK  wrote:
>>
>>> Hello all,
>>> Have you tried accessing Spark application using Restful  web-services?
>>>
>>> I have requirement where remote user submit the  request with some data,
>>> it should be sent to Spark and job should run in Hadoop cluster mode.
>>> Output should be sent back to user.
>>>
>>> Please share your  expertise.
>>> Thanks,
>>> Abhishek
>>>
>>
>>
>


Re: Broadcast big dataset

2016-10-01 Thread Anastasios Zouzias
Hey,

Is the driver running OOM? Try 8g on the driver memory. Speaking of which,
how do you estimate that your broadcasted dataset is 500M?

Best,
Anastasios

Am 29.09.2016 5:32 AM schrieb "WangJianfei" :

> First thank you very much!
>   My executor memeory is also 4G, but my spark version is 1.5. Does spark
> version make a trouble?
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Broadcast-big-
> dataset-tp19127p19143.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

2016-10-01 Thread Benjamin Kim
Mich,

I know up until CDH 5.4 we had to add the HTrace jar to the classpath to make 
it work using the command below. But after upgrading to CDH 5.7, it became 
unnecessary.

echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >> 
/etc/spark/conf/classpath.txt

Hope this helps.

Cheers,
Ben


> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh  wrote:
> 
> Trying bulk load using Hfiles in Spark as below example:
> 
> import org.apache.spark._
> import org.apache.spark.rdd.NewHadoopRDD
> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
> import org.apache.hadoop.hbase.client.HBaseAdmin
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.hbase.HColumnDescriptor
> import org.apache.hadoop.hbase.util.Bytes
> import org.apache.hadoop.hbase.client.Put;
> import org.apache.hadoop.hbase.client.HTable;
> import org.apache.hadoop.hbase.mapred.TableOutputFormat
> import org.apache.hadoop.mapred.JobConf
> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
> import org.apache.hadoop.mapreduce.Job
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
> import org.apache.hadoop.hbase.KeyValue
> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
> 
> So far no issues.
> 
> Then I do
> 
> val conf = HBaseConfiguration.create()
> conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hbase-default.xml, hbase-site.xml
> val tableName = "testTable"
> tableName: String = testTable
> 
> But this one fails:
> 
> scala> val table = new HTable(conf, tableName)
> java.io.IOException: java.lang.reflect.InvocationTargetException
>   at 
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
>   at 
> org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:431)
>   at 
> org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:424)
>   at 
> org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:302)
>   at org.apache.hadoop.hbase.client.HTable.(HTable.java:185)
>   at org.apache.hadoop.hbase.client.HTable.(HTable.java:151)
>   ... 52 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
>   ... 57 more
> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:216)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:419)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>   at 
> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105)
>   at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:905)
>   at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:648)
>   ... 62 more
> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
> 
> I have got all the jar files in spark-defaults.conf
> 
> spark.driver.extraClassPath  
> /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
> spark.executor.extraClassPath
> /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
> 
> 
> and also in Spark shell where I test the code
> 
>  --jars 
> /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/hbase-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handler-2.1.0.jar'
> 
> So any ideas will be appreciated.
> 
> 

Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

2016-10-01 Thread Mich Talebzadeh
Trying bulk load using Hfiles in Spark as below example:

import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles

So far no issues.

Then I do

val conf = HBaseConfiguration.create()
conf: org.apache.hadoop.conf.Configuration = Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
val tableName = "testTable"
tableName: String = testTable

But this one fails:

scala> val table = new HTable(conf, tableName)
java.io.IOException: java.lang.reflect.InvocationTargetException
  at
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
  at
org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:431)
  at
org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:424)
  at
org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:302)
  at org.apache.hadoop.hbase.client.HTable.(HTable.java:185)
  at org.apache.hadoop.hbase.client.HTable.(HTable.java:151)
  ... 52 elided
Caused by: java.lang.reflect.InvocationTargetException:
java.lang.NoClassDefFoundError: org/apache/htrace/Trace
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
  ... 57 more
Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
  at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:216)
  at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:419)
  at
org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
  at
org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105)
  at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:905)
  at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:648)
  ... 62 more
Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace

I have got all the jar files in spark-defaults.conf

spark.driver.extraClassPath
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
spark.executor.extraClassPath
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar


and also in Spark shell where I test the code

 --jars
/home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/hbase-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handler-2.1.0.jar'

So any ideas will be appreciated.

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Deep learning libraries for scala

2016-10-01 Thread janardhan shetty
Apparently there are no Neural network implementations in tensorframes
which we can use right ? or Am I missing something here.

I would like to apply neural networks for an NLP settting is there are any
implementations which can be looked into ?

On Fri, Sep 30, 2016 at 8:14 PM, Suresh Thalamati <
suresh.thalam...@gmail.com> wrote:

> Tensor frames
>
> https://spark-packages.org/package/databricks/tensorframes
>
> Hope that helps
> -suresh
>
> On Sep 30, 2016, at 8:00 PM, janardhan shetty 
> wrote:
>
> Looking for scala dataframes in particular ?
>
> On Fri, Sep 30, 2016 at 7:46 PM, Gavin Yue  wrote:
>
>> Skymind you could try. It is java
>>
>> I never test though.
>>
>> > On Sep 30, 2016, at 7:30 PM, janardhan shetty 
>> wrote:
>> >
>> > Hi,
>> >
>> > Are there any good libraries which can be used for scala deep learning
>> models ?
>> > How can we integrate tensorflow with scala ML ?
>>
>
>
>


Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-10-01 Thread Sean Owen
"Compile failed via zinc server"

Try shutting down zinc. Something's funny about your compile server.
It's not required anyway.

On Sat, Oct 1, 2016 at 3:24 PM, Marco Mistroni  wrote:
> Hi guys
>  sorry to annoy you on this but i am getting nowhere. So far i have tried to
> build spark 2.0 on my local laptop with no success so i blamed my
> laptop poor performance
> So today i fired off an EC2 Ubuntu 16.06 Instance and installed the
> following (i copy paste commands here)
>
> ubuntu@ip-172-31-40-104:~/spark$ history
> sudo apt-get install -y python-software-properties
> sudo apt-get install -y software-properties-common
> sudo add-apt-repository -y ppa:webupd8team/java
> sudo apt-get install -y oracle-java8-installer
> sudo apt-get install -y git
> git clone git://github.com/apache/spark.git
> cd spark
>
> Then launched the following commands:
> First thi sone, with only yarn, as i dont need hadoop
>
>  ./build/mvn -Pyarn  -DskipTests clean package
>
> This failed. after kicking off same command with -X i got this
>
> ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on
> project spark-core_2.11: Execution scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed ->
> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-core_2.11: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
> ... 20 more
> Caused by: Compile failed via zinc server
> at
> sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
> at
> sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
> at
> scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
> at
> scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
> at
> scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
> ... 21 more
>
>
>
> Then i tried to use hadoop even though i dont need hadoop for my code
>
> ./build/mvn -Pyarn -Phadoop-2.4  -DskipTests clean package
>
> This failed again, at exactly the same point, with the same error
>
> Then i thought maybe i used an old version of hadoop, so i tried to use 2.7
>
>   ./build/mvn -Pyarn -Phadoop-2.7  -DskipTests clean 

Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-10-01 Thread Marco Mistroni
Hi guys
 sorry to annoy you on this but i am getting nowhere. So far i have tried
to build spark 2.0 on my local laptop with no success so i blamed my
laptop poor performance
So today i fired off an EC2 Ubuntu 16.06 Instance and installed the
following (i copy paste commands here)

ubuntu@ip-172-31-40-104:~/spark$ history
sudo apt-get install -y python-software-properties
sudo apt-get install -y software-properties-common
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get install -y oracle-java8-installer
sudo apt-get install -y git
git clone git://github.com/apache/spark.git
cd spark

Then launched the following commands:
First thi sone, with only yarn, as i dont need hadoop

 ./build/mvn -Pyarn  -DskipTests clean package

This failed. after kicking off same command with -X i got this

ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
on project spark-core_2.11: Execution scala-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
-> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
(scala-compile-first) on project spark-core_2.11: Execution
scala-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
scala-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
... 20 more
Caused by: Compile failed via zinc server
at
sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
at
sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
at
scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
at
scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
at
scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
at
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
... 21 more



Then i tried to use hadoop even though i dont need hadoop for my code

./build/mvn -Pyarn -Phadoop-2.4  -DskipTests clean package

This failed again, at exactly the same point, with the same error

Then i thought maybe i used an old version of hadoop, so i tried to use 2.7

  ./build/mvn -Pyarn -Phadoop-2.7  -DskipTests clean package

Nada. same error.


I have tried sbt, no luck

Could anyone suggest me a mvn proflie to use to build spark 2.0? i am
starting to suspect (before arguing that there's something wrong with spark
git) that i am using somehow wrong parameters. or perhaps i should
install scala 2.11 before i install spark? or Maven  ?

kr
 marco


On Fri, Sep 30, 2016 at 8:23 PM, Marco Mistroni 

Performance problem with BlockMatrix.add()

2016-10-01 Thread Andi
Hello,

I'm implementing a pagerank-like iterative algorithm where in each iteration
a number of matrix operations is performed. One step is to add two matrices
that are both the result of several matrix multiplications. Unfortunately,
using the add() Operation of BlockMatrix Spark gets completely stuck for >
30 sec. I implemented an alternative (naive) addition myself, but it still
takes > 10 sec. The matrix consists of about 1500x1500 entries. I tested in
local mode + standalone mode on a laptop with 4 cores and 20gb ram. The code
that causes the problems can be found here  http://imgur.com/a/ZaomX
  
Any help is highly appreciated!

Thanks
Andi







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problem-with-BlockMatrix-add-tp27825.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



execution sequence puzzle

2016-10-01 Thread chen yong
Hello everybody,

I am puzzled by the execution sequence of the following scala program. Please 
tell me if  it  run in the same sequence on your computer,.and it is normal.

Thanks

execution sequence according to line number: 
26-7-20-27-9-10-12-13-15-7-20-17-18-9-10-12-13-15-7-20-17-18


1.

import scala.collection.mutable.ArrayBuffer


2.




3.

object TestCase1{


4.




5.

def func(test:Iterator[(Int,Long)]): Iterator[(Int,Long)]={


6.




7.

println("in")


8.

val test1=test.flatmap{


9.

case(item,count)=>


10.

val newPrefix=item


11.




12.

println(count)


13.

val a=Iterator.single((newPrefix,count))


14.




15.

func(a)


16.




17.

val c = a


18.

c


19.

}


20.

test1


21.

}


22.




23.

def main(args: Array[String]){


24.

val freqItems = ArrayBuffer((2,3L),(3,2L),(4,1L))


25.

val test = freqItems.toIterator


26.

val result = func(test)


27.

val reer = result.toArray


28.




29.

}


30.

}




答复: get different results when debugging and running scala program

2016-10-01 Thread chen yong
Dear Jakob,


Thanks for your reply.


The output text in console is as follows


runing output:

1
2
3
4
5
6
7

in
3
in
2
in
1
in



debugging output:
01
02
03
04
05
06
07
08
09
10
11

in
3
in
2
in
2
in
1
in
1
in




发件人: Jakob Odersky 
发送时间: 2016年10月1日 8:44
收件人: chen yong
抄送: user@spark.apache.org
主题: Re: get different results when debugging and running scala program

There is no image attached, I'm not sure how the apache mailing lists
handle them. Can you provide the output as text?

best,
--Jakob

On Fri, Sep 30, 2016 at 8:25 AM, chen yong  wrote:
> Hello All,
>
>
>
> I am using IDEA 15.0.4 to debug a scala program. It is strange to me that
> the results were different when I debug or run the program. The differences
> can be seen in the attached filed run.jpg and debug.jpg. The code lines of
> the scala program are shown below.
>
>
> Thank you all
>
>
> ---
>
> import scala.collection.mutable.ArrayBuffer
>
> object TestCase1{
> def func(test:Iterator[(Int,Long)]): Iterator[(Int,Long)]={
> println("in")
> val test1=test.flatmap{
> case(item,count)=>
> val newPrefix=item
> println(count)
> val a=Iterator.single((newPrefix,count))
> func(a)
> val c = a
> c
> }
> test1
> }
> def main(args: Array[String]){
> val freqItems = ArrayBuffer((2,3L),(3,2L),(4,1L))
> val test = freqItems.toIterator
> val result = func(test)
> val reer = result.toArray
> }
> }
>
>
>


Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-10-01 Thread Takeshi Yamamuro
I got this info. from a hadoop jira ticket:
https://issues.apache.org/jira/browse/MAPREDUCE-5485

// maropu

On Sat, Oct 1, 2016 at 7:14 PM, Igor Berman  wrote:

> Takeshi, why are you saying this, how have you checked it's only used from
> 2.7.3?
> We use spark 2.0 which is shipped with hadoop dependency of 2.7.2 and we
> use this setting.
> We've sort of "verified" it's used by configuring log of file output
> commiter
>
> On 30 September 2016 at 03:12, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> FYI: Seems 
>> `sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”)`
>> is only available at hadoop-2.7.3+.
>>
>> // maropu
>>
>>
>> On Thu, Sep 29, 2016 at 9:28 PM, joffe.tal  wrote:
>>
>>> You can use partition explicitly by adding "/=>> value>" to
>>> the end of the path you are writing to and then use overwrite.
>>>
>>> BTW in Spark 2.0 you just need to use:
>>>
>>> sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.al
>>> gorithm.version","2”)
>>> and use s3a://
>>>
>>> and you can work with regular output committer (actually
>>> DirectParquetOutputCommitter is no longer available in Spark 2.0)
>>>
>>> so if you are planning on upgrading this could be another motivation
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/S3-DirectParquetOutputCommitter-Partit
>>> ionBy-SaveMode-Append-tp26398p27810.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-10-01 Thread Igor Berman
Takeshi, why are you saying this, how have you checked it's only used from
2.7.3?
We use spark 2.0 which is shipped with hadoop dependency of 2.7.2 and we
use this setting.
We've sort of "verified" it's used by configuring log of file output
commiter

On 30 September 2016 at 03:12, Takeshi Yamamuro 
wrote:

> Hi,
>
> FYI: Seems 
> `sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”)`
> is only available at hadoop-2.7.3+.
>
> // maropu
>
>
> On Thu, Sep 29, 2016 at 9:28 PM, joffe.tal  wrote:
>
>> You can use partition explicitly by adding "/=> value>" to
>> the end of the path you are writing to and then use overwrite.
>>
>> BTW in Spark 2.0 you just need to use:
>>
>> sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.al
>> gorithm.version","2”)
>> and use s3a://
>>
>> and you can work with regular output committer (actually
>> DirectParquetOutputCommitter is no longer available in Spark 2.0)
>>
>> so if you are planning on upgrading this could be another motivation
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/S3-DirectParquetOutputCommitter-Partit
>> ionBy-SaveMode-Append-tp26398p27810.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>