Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-12 Thread Mohammad Tariq
Hi Divya,

Do you you have inbounds enabled on port 50070 of your NN machine. Also,
it's a good idea to have the public DNS in your /etc/hosts for proper name
resolution.


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti





[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Tue, Sep 13, 2016 at 9:28 AM, Divya Gehlot 
wrote:

> Hi,
> I am on EMR 4.7 with Spark 1.6.1   and Hadoop 2.7.2
> When I am trying to view Any of the web UI of the cluster either hadoop or
> Spark ,I am getting below error
> "
> This site can’t be reached
>
> "
> Has anybody using EMR and able to view WebUI .
> Could you please share the steps.
>
> Would really appreciate the help.
>
> Thanks,
> Divya
>


[Erorr:]vieiwng Web UI on EMR cluster

2016-09-12 Thread Divya Gehlot
Hi,
I am on EMR 4.7 with Spark 1.6.1   and Hadoop 2.7.2
When I am trying to view Any of the web UI of the cluster either hadoop or
Spark ,I am getting below error
"
This site can’t be reached

"
Has anybody using EMR and able to view WebUI .
Could you please share the steps.

Would really appreciate the help.

Thanks,
Divya


unsubscribe

2016-09-12 Thread 常明敏
unsubscribe



Re: Debugging a spark application in a none lazy mode

2016-09-12 Thread Takeshi Yamamuro
istm what you can only do is inject `collect` methods map-by-map like;

`df.map(x => do something...).collect`  // check intermediate results in
maps

This only works for small datasets though.

// maropu

On Tue, Sep 13, 2016 at 1:38 AM, Attias, Hagai  wrote:

> Hi,
>
> Not sure what you mean, can you give an example?
>
>
>
> Hagai.
>
>
>
> *From: *Takeshi Yamamuro 
> *Date: *Monday, September 12, 2016 at 7:24 PM
> *To: *Hagai Attias 
> *Cc: *"user@spark.apache.org" 
> *Subject: *Re: Debugging a spark application in a none lazy mode
>
>
>
> Hi,
>
>
>
> Spark does not have such mode.
>
> How about getting local arrays by `collect` methods for debugging?
>
>
>
> // maropu
>
>
>
> On Tue, Sep 13, 2016 at 12:44 AM, Hagai  wrote:
>
> Hi guys,
> Lately I was looking for a way to debug my spark application locally.
>
> However, since all transformations are actually being executed when the
> action is encountered, I have no way to look at the data after each
> transformation. Does spark support a non-lazy mode which enables to execute
> the transformations locally after each statement?
>
> Thanks,
> Hagai.
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Debugging-a-spark-application-
> in-a-none-lazy-mode-tp27695.html
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
> --
>
> ---
> Takeshi Yamamuro
>



-- 
---
Takeshi Yamamuro


Check if a nested column exists in DataFrame

2016-09-12 Thread Arun Patel
I'm trying to analyze XML documents using spark-xml package.  Since all XML
columns are optional, some columns may or may not exist. When I register
the Dataframe as a table, how do I check if a nested column is existing or
not? My column name is "emp" which is already exploded and I am trying to
check if the nested column "emp.mgr.col" exists or not.  If it exists, I
need to use it.  If it does not exist, I should set it to null.  Is there a
way to achieve this?

Please note I am not able to use .columns method because it does not show
the nested columns.

Also, note that I  cannot manually specify the schema because of my
requirement.

I'm trying this in Pyspark.

Thank you.


Re: Strings not converted when calling Scala code from a PySpark app

2016-09-12 Thread Holden Karau
Ah yes so the Py4J conversions only apply on the driver program - your
DStream however is RDDs of pickled objects. If you want to with a transform
function use Spark SQL transferring DataFrames back and forth between
Python and Scala spark can be much easier.

On Monday, September 12, 2016, Alexis Seigneurin 
wrote:

> Hi,
>
>
> *TL;DR - I have what looks like a DStream of Strings in a PySpark
> application. I want to send it as a DStream[String] to a Scala library.
> Strings are not converted by Py4j, though.*
>
>
> I'm working on a PySpark application that pulls data from Kafka using
> Spark Streaming. My messages are strings and I would like to call a method
> in Scala code, passing it a DStream[String] instance. However, I'm unable
> to receive proper JVM strings in the Scala code. It looks to me like the
> Python strings are not converted into Java strings but, instead, are
> serialized.
>
> My question would be: how to get Java strings out of the DStream object?
>
>
> Here is the simplest Python code I came up with:
>
> from pyspark.streaming import StreamingContext
> ssc = StreamingContext(sparkContext=sc, batchDuration=int(1))
>
> from pyspark.streaming.kafka import KafkaUtils
> stream = KafkaUtils.createDirectStream(ssc, ["IN"],
> {"metadata.broker.list": "localhost:9092"})
> values = stream.map(lambda tuple: tuple[1])
>
> ssc._jvm.com.seigneurin.MyPythonHelper.doSomething(values._jdstream)
>
> ssc.start()
>
>
> I'm running this code in PySpark, passing it the path to my JAR:
>
> pyspark --driver-class-path ~/path/to/my/lib-0.1.1-SNAPSHOT.jar
>
>
> On the Scala side, I have:
>
> package com.seigneurin
>
> import org.apache.spark.streaming.api.java.JavaDStream
>
> object MyPythonHelper {
>   def doSomething(jdstream: JavaDStream[String]) = {
> val dstream = jdstream.dstream
> dstream.foreachRDD(rdd => {
>   rdd.foreach(println)
> })
>   }
> }
>
>
> Now, let's say I send some data into Kafka:
>
> echo 'foo bar' | $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list
> localhost:9092 --topic IN
>
>
> The println statement in the Scala code prints something that looks like:
>
> [B@758aa4d9
>
>
> I expected to get foo bar instead.
>
> Now, if I replace the simple println statement in the Scala code with the
> following:
>
> rdd.foreach(v => println(v.getClass.getCanonicalName))
>
>
> I get:
>
> java.lang.ClassCastException: [B cannot be cast to java.lang.String
>
>
> This suggests that the strings are actually passed as arrays of bytes.
>
> If I simply try to convert this array of bytes into a string (I know I'm
> not even specifying the encoding):
>
>   def doSomething(jdstream: JavaDStream[Array[Byte]]) = {
> val dstream = jdstream.dstream
> dstream.foreachRDD(rdd => {
>   rdd.foreach(bytes => println(new String(bytes)))
> })
>   }
>
>
> I get something that looks like (special characters might be stripped off):
>
> �]qXfoo barqa.
>
>
> This suggests the Python string was serialized (pickled?). How could I
> retrieve a proper Java string instead?
>
>
> Thanks,
> Alexis
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Strings not converted when calling Scala code from a PySpark app

2016-09-12 Thread Alexis Seigneurin
Hi,


*TL;DR - I have what looks like a DStream of Strings in a PySpark
application. I want to send it as a DStream[String] to a Scala library.
Strings are not converted by Py4j, though.*


I'm working on a PySpark application that pulls data from Kafka using Spark
Streaming. My messages are strings and I would like to call a method in
Scala code, passing it a DStream[String] instance. However, I'm unable to
receive proper JVM strings in the Scala code. It looks to me like the
Python strings are not converted into Java strings but, instead, are
serialized.

My question would be: how to get Java strings out of the DStream object?


Here is the simplest Python code I came up with:

from pyspark.streaming import StreamingContext
ssc = StreamingContext(sparkContext=sc, batchDuration=int(1))

from pyspark.streaming.kafka import KafkaUtils
stream = KafkaUtils.createDirectStream(ssc, ["IN"],
{"metadata.broker.list": "localhost:9092"})
values = stream.map(lambda tuple: tuple[1])

ssc._jvm.com.seigneurin.MyPythonHelper.doSomething(values._jdstream)

ssc.start()


I'm running this code in PySpark, passing it the path to my JAR:

pyspark --driver-class-path ~/path/to/my/lib-0.1.1-SNAPSHOT.jar


On the Scala side, I have:

package com.seigneurin

import org.apache.spark.streaming.api.java.JavaDStream

object MyPythonHelper {
  def doSomething(jdstream: JavaDStream[String]) = {
val dstream = jdstream.dstream
dstream.foreachRDD(rdd => {
  rdd.foreach(println)
})
  }
}


Now, let's say I send some data into Kafka:

echo 'foo bar' | $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list
localhost:9092 --topic IN


The println statement in the Scala code prints something that looks like:

[B@758aa4d9


I expected to get foo bar instead.

Now, if I replace the simple println statement in the Scala code with the
following:

rdd.foreach(v => println(v.getClass.getCanonicalName))


I get:

java.lang.ClassCastException: [B cannot be cast to java.lang.String


This suggests that the strings are actually passed as arrays of bytes.

If I simply try to convert this array of bytes into a string (I know I'm
not even specifying the encoding):

  def doSomething(jdstream: JavaDStream[Array[Byte]]) = {
val dstream = jdstream.dstream
dstream.foreachRDD(rdd => {
  rdd.foreach(bytes => println(new String(bytes)))
})
  }


I get something that looks like (special characters might be stripped off):

�]qXfoo barqa.


This suggests the Python string was serialized (pickled?). How could I
retrieve a proper Java string instead?


Thanks,
Alexis


Re: Spark with S3 DirectOutputCommitter

2016-09-12 Thread Srikanth
Thanks Steve!

We are already using HDFS as an intermediate store. This is for the last
stage of processing which has to put data in S3.
The output is partitioned by 3 fields, like
.../field1=111/field2=999/date=-MM-DD/*
Given that there are 100s for folders and 1000s of subfolder and part
files, rename from _temporary is just not practical in S3.
I guess we have to add another stage with S3Distcp??

Srikanth

On Sun, Sep 11, 2016 at 2:34 PM, Steve Loughran 
wrote:

>
> > On 9 Sep 2016, at 21:54, Srikanth  wrote:
> >
> > Hello,
> >
> > I'm trying to use DirectOutputCommitter for s3a in Spark 2.0. I've tried
> a few configs and none of them seem to work.
> > Output always creates _temporary directory. Rename is killing
> performance.
>
> > I read some notes about DirectOutputcommitter causing problems with
> speculation turned on. Was this option removed entirely?
>
> Spark turns off any committer with the word "direct' in its name if
> speculation==true . Concurrency, see.
>
> even on on-speculative execution, the trouble with the direct options is
> that executor/job failures can leave incomplete/inconsistent work around
> —and the things downstream wouldn't even notice
>
> There's work underway to address things, work which requires a consistent
> metadata store alongside S3 ( HADOOP-13345 : S3Guard).
>
> For now: stay with the file output committer
>
> hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
> hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true
>
> Even better: use HDFS as the intermediate store for work, only do a bulk
> upload at the end.
>
> >
> >   val spark = SparkSession.builder()
> > .appName("MergeEntities")
> > .config("spark.sql.warehouse.dir",
> mergeConfig.getString("sparkSqlWarehouseDir"))
> > .config("fs.s3a.buffer.dir", "/tmp")
> > .config("spark.hadoop.mapred.output.committer.class",
> classOf[DirectOutputCommitter].getCanonicalName)
> > .config("mapred.output.committer.class",
> classOf[DirectOutputCommitter].getCanonicalName)
> > .config("mapreduce.use.directfileoutputcommitter",
> "true")
> > //.config("spark.sql.sources.outputCommitterClass",
> classOf[DirectOutputCommitter].getCanonicalName)
> > .getOrCreate()
> >
> > Srikanth
>
>


LDA spark ML visualization

2016-09-12 Thread janardhan shetty
Hi,

I am trying to visualize the LDA model developed in spark scala (2.0 ML) in
LDAvis.

Is there any links to convert the spark model parameters to the following 5
params to visualize ?

1. φ, the K × W matrix containing the estimated probability mass function
over the W terms in the vocabulary for each of the K topics in the model.
Note that φkw > 0 for all k ∈ 1...K and all w ∈ 1...W, because of the
priors. (Although our software allows values of zero due to rounding). Each
of the K rows of φ must sum to one.
2. θ, the D × K matrix containing the estimated probability mass function
over the K topics in the model for each of the D documents in the corpus.
Note that θdk > 0 for all d ∈ 1...D and all k ∈ 1...K, because of the
priors (although, as above, our software accepts zeroes due to rounding).
Each of the D rows of θ must sum to one.
3. nd, the number of tokens observed in document d, where nd is required to
be an integer greater than zero, for documents d = 1...D. Denoted
doc.length in our code.
4. vocab, the length-W character vector containing the terms in the
vocabulary (listed in the same order as the columns of φ).
5. Mw, the frequency of term w across the entire corpus, where Mw is
required to be an integer greater than zero for each term w = 1...W.
Denoted term.frequency in our code.


Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-12 Thread Mario Ds Briggs


Daniel,

I believe it is related to
https://issues.apache.org/jira/browse/SPARK-13979 and happens only when
task fails in a executor (probably for some other reason u hit the latter
in parquet and not csv).

The PR in there, should be shortly available in IBM's Analytics for Spark.


thanks
Mario



From:   Adam Roberts/UK/IBM
To: Mario Ds Briggs/India/IBM@IBMIN
Date:   12/09/2016 09:37 pm
Subject:Fw: Spark + Parquet + IBM Block Storage at Bluemix


Mario, incase you've not seen this...
   
   
   
   
   
   Adam Roberts
   
   IBM Spark   
   Team Lead   
   
   Runtime 
   Technologies
   - Hursley   
   
   
   
   
   
   
   
   
   



- Forwarded by Adam Roberts/UK/IBM on 12/09/2016 17:06 -

From:   Daniel Lopes 
To: Steve Loughran 
Cc: user 
Date:   12/09/2016 13:05
Subject:Re: Spark + Parquet + IBM Block Storage at Bluemix



Thanks Steve,

But this error occurs only with parquet files, CSVs works.

Best,

Daniel Lopes
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br

On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran 
wrote:

On 9 Sep 2016, at 17:56, Daniel Lopes 
wrote:

Hi, someone can help

I'm trying to use parquet in IBM Block Storage at Spark but when I
try to load get this error:

using this config

credentials = {
  "name": "keystone",
  "auth_url": "https://identity.open.softlayer.com;,
  "project": "object_storage_23f274c1_d11XXXe634",
  "projectId": "XXd9c4aa39b7c7eb",
  "region": "dallas",
  "userId": "X64087180b40X2b909",
  "username": "admin_9dd810f8901d48778XX",
  "password": "chX6_",
  "domainId": "c1ddad17cfcX41",
  "domainName": "10XX",
  "role": "admin"
}

def set_hadoop_config(credentials):
    """This function sets the Hadoop configuration with given
credentials,
    so it is possible to access data using SparkContext"""

    prefix = "fs.swift.service." + credentials['name']
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + ".auth.url", credentials
['auth_url']+'/v3/auth/tokens')
    hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
    hconf.set(prefix + ".tenant", credentials['projectId'])
    hconf.set(prefix + ".username", credentials['userId'])
    hconf.set(prefix + ".password", credentials['password'])
    hconf.setInt(prefix + ".http.port", 8080)
    hconf.set(prefix + ".region", credentials['region'])
    hconf.setBoolean(prefix + ".public", True)

set_hadoop_config(credentials)

-

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 train.groupby('Acordo').count().show()

Py4JJavaError: An error occurred while calling o406.showString.
: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 60 in stage 30.0 failed 10 times, most recent
failure: Lost task 60.9 in stage 30.0 (TID 2556,
yp-spark-dal09-env5-0039):
org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException:
Missing mandatory configuration option:
   

How to know how are the slaves for an application

2016-09-12 Thread Xiaoye Sun
Hi all,

I am currently making some changes in Spark in my research project.

In my development, after an application has been submitted to the spark
master, I want to get the IP addresses of all the slaves used by that
application, so that the spark master is able to talk to the slave machines
through a proposed mechanism. I am wondering which class/object in spark
master has such information and will it be a different case when the
cluster is managed by a standalone scheduler, Yarn and Mesos.

I saw something related to this question in the master's log in standalone
mode as follows. However, in function executorAdded in Class
SparkDeploySchedulerBackend. it just prints a log without adding the slave
to anything.
I am using spark 1.6.1.

16/09/12 11:34:41.262 INFO AppClient$ClientEndpoint: Connecting to master
spark://192.168.50.105:7077...
16/09/12 11:34:41.283 DEBUG TransportClientFactory: Creating new connection
to /192.168.50.105:7077
16/09/12 11:34:41.302 DEBUG ResourceLeakDetector:
-Dio.netty.leakDetectionLevel: simple
16/09/12 11:34:41.307 DEBUG TransportClientFactory: Connection to /
192.168.50.105:7077 successful, running bootstraps...
16/09/12 11:34:41.307 DEBUG TransportClientFactory: Successfully created
connection to /192.168.50.105:7077 after 23 ms (0 ms spent in bootstraps)
16/09/12 11:34:41.334 DEBUG Recycler:
-Dio.netty.recycler.maxCapacity.default: 262144
16/09/12 11:34:41.458 INFO SparkDeploySchedulerBackend: Connected to Spark
cluster with app ID app-20160912113441-
16/09/12 11:34:41.459 DEBUG BlockManager: BlockManager initialize is called
16/09/12 11:34:41.463 DEBUG TransportServer: Shuffle server started on port
:35874
16/09/12 11:34:41.463 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 35874.
16/09/12 11:34:41.464 INFO NettyBlockTransferService: Server created on
35874
16/09/12 11:34:41.465 INFO BlockManagerMaster: Trying to register
BlockManager
16/09/12 11:34:41.468 INFO BlockManagerMasterEndpoint: Registering block
manager 192.168.50.105:35874 with 3.8 GB RAM, BlockManagerId(driver,
192.168.50.105, 35874)
16/09/12 11:34:41.470 INFO BlockManagerMaster: Registered BlockManager
*16/09/12 11:34:41.486 INFO AppClient$ClientEndpoint: Executor added:
app-20160912113441-/0 on worker-20160912113428-192.168.50.106-59927
(192.168.50.106:59927 ) with 1 cores*
*16/09/12 11:34:41.486 INFO SparkDeploySchedulerBackend: Granted executor
ID app-20160912113441-/0 on hostPort 192.168.50.106:59927
 with 1 cores, 6.0 GB RAM*
*16/09/12 11:34:41.487 INFO AppClient$ClientEndpoint: Executor added:
app-20160912113441-/1 on worker-20160912113428-192.168.50.106-59927
(192.168.50.106:59927 ) with 1 cores*
*16/09/12 11:34:41.487 INFO SparkDeploySchedulerBackend: Granted executor
ID app-20160912113441-/1 on hostPort 192.168.50.106:59927
 with 1 cores, 6.0 GB RAM*
*16/09/12 11:34:41.488 INFO AppClient$ClientEndpoint: Executor added:
app-20160912113441-/2 on worker-20160912113405-192.168.50.108-35454
(192.168.50.108:35454 ) with 1 cores*
*16/09/12 11:34:41.489 INFO SparkDeploySchedulerBackend: Granted executor
ID app-20160912113441-/2 on hostPort 192.168.50.108:35454
 with 1 cores, 6.0 GB RAM*

Thanks!

Best,
Xiaoye


Re: Spark transformations

2016-09-12 Thread Thunder Stumpges
Yep, totally with you on this. None of it is ideal but doesn't sound like
there will be any changes coming to the visibility of ml supporting
classes.
-Thunder
On Mon, Sep 12, 2016 at 10:10 AM janardhan shetty 
wrote:

> Thanks Thunder. To copy the code base is difficult since we need to copy
> in entirety or transitive dependency files as well.
> If we need to do complex operations of taking a column as a whole instead
> of each element in a row is not possible as of now.
>
> Trying to find few pointers to easily solve this.
>
> On Mon, Sep 12, 2016 at 9:43 AM, Thunder Stumpges <
> thunder.stump...@gmail.com> wrote:
>
>> Hi Janardhan,
>>
>> I have run into similar issues and asked similar questions. I also ran
>> into many problems with private code when trying to write my own
>> Model/Transformer/Estimator. (you might be able to find my question to the
>> group regarding this, I can't really tell if my emails are getting through,
>> as I don't get any responses). For now I have resorted to copying out the
>> code that I need from the spark codebase and into mine. I'm certain this is
>> not the best, but it has to be better than "implementing it myself" which
>> was what the only response to my question said to do.
>>
>> As for the transforms, I also asked a similar question. The only way I've
>> seen it done in code is using a UDF. As you mention, the UDF can only
>> access fields on a "row by row" basis. I have not gotten any replies at all
>> on my question, but I also need to do some more complicated operation in my
>> work (join to another model RDD, flat-map, calculate, reduce) in order to
>> get the value for the new column. So far no great solution.
>>
>> Sorry I don't have any answers, but wanted to chime in that I am also a
>> bit stuck on similar issues. Hope we can find a workable solution soon.
>> Cheers,
>> Thunder
>>
>>
>>
>> On Tue, Sep 6, 2016 at 1:32 PM janardhan shetty 
>> wrote:
>>
>>> Noticed few things about Spark transformers just wanted to be clear.
>>>
>>> Unary transformer:
>>>
>>> createTransformFunc: IN => OUT  = { *item* => }
>>> Here *item *is single element and *NOT* entire column.
>>>
>>> I would like to get the number of elements in that particular column.
>>> Since there is *no forward checking* how can we get this information ?
>>> We have visibility into single element and not the entire column.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Sep 4, 2016 at 9:30 AM, janardhan shetty >> > wrote:
>>>
 In scala Spark ML Dataframes.

 On Sun, Sep 4, 2016 at 9:16 AM, Somasundaram Sekar <
 somasundar.se...@tigeranalytics.com> wrote:

> Can you try this
>
>
> https://www.linkedin.com/pulse/hive-functions-udfudaf-udtf-examples-gaurav-singh
>
> On 4 Sep 2016 9:38 pm, "janardhan shetty" 
> wrote:
>
>> Hi,
>>
>> Is there any chance that we can send entire multiple columns to an
>> udf and generate a new column for Spark ML.
>> I see similar approach as VectorAssembler but not able to use few
>> classes /traitslike HasInputCols, HasOutputCol, DefaultParamsWritable 
>> since
>> they are private.
>>
>> Any leads/examples is appreciated in this regard..
>>
>> Requirement:
>> *Input*: Multiple columns of a Dataframe
>> *Output*:  Single new modified column
>>
>

>>>
>


Re: Partition n keys into exacly n partitions

2016-09-12 Thread Denis Bolshakov
Just provide own partitioner.

One I wrote a partitioner which keeps similar keys together in one
 partitioner.

Best regards,
Denis

On 12 September 2016 at 19:44, sujeet jog  wrote:

> Hi,
>
> Is there a way to partition set of data with n keys into exactly n
> partitions.
>
> For ex : -
>
> tuple of 1008 rows with key as x
> tuple of 1008 rows with key as y   and so on  total 10 keys ( x, y etc )
>
> Total records = 10080
> NumOfKeys = 10
>
> i want to partition the 10080 elements into exactly 10 partitions with
> each partition having elements with unique key
>
> Is there a way to make this happen ?.. any ideas on implementing custom
> partitioner.
>
>
> The current partitioner i'm using is HashPartitioner from which there are
> cases where key.hascode() % numPartitions  for keys of x & y become same.
>
>  hence many elements with different keys fall into single partition at
> times.
>
>
>
> Thanks,
> Sujeet
>



-- 
//with Best Regards
--Denis Bolshakov
e-mail: bolshakov.de...@gmail.com


Re: Spark transformations

2016-09-12 Thread janardhan shetty
Thanks Thunder. To copy the code base is difficult since we need to copy in
entirety or transitive dependency files as well.
If we need to do complex operations of taking a column as a whole instead
of each element in a row is not possible as of now.

Trying to find few pointers to easily solve this.

On Mon, Sep 12, 2016 at 9:43 AM, Thunder Stumpges <
thunder.stump...@gmail.com> wrote:

> Hi Janardhan,
>
> I have run into similar issues and asked similar questions. I also ran
> into many problems with private code when trying to write my own
> Model/Transformer/Estimator. (you might be able to find my question to the
> group regarding this, I can't really tell if my emails are getting through,
> as I don't get any responses). For now I have resorted to copying out the
> code that I need from the spark codebase and into mine. I'm certain this is
> not the best, but it has to be better than "implementing it myself" which
> was what the only response to my question said to do.
>
> As for the transforms, I also asked a similar question. The only way I've
> seen it done in code is using a UDF. As you mention, the UDF can only
> access fields on a "row by row" basis. I have not gotten any replies at all
> on my question, but I also need to do some more complicated operation in my
> work (join to another model RDD, flat-map, calculate, reduce) in order to
> get the value for the new column. So far no great solution.
>
> Sorry I don't have any answers, but wanted to chime in that I am also a
> bit stuck on similar issues. Hope we can find a workable solution soon.
> Cheers,
> Thunder
>
>
>
> On Tue, Sep 6, 2016 at 1:32 PM janardhan shetty 
> wrote:
>
>> Noticed few things about Spark transformers just wanted to be clear.
>>
>> Unary transformer:
>>
>> createTransformFunc: IN => OUT  = { *item* => }
>> Here *item *is single element and *NOT* entire column.
>>
>> I would like to get the number of elements in that particular column.
>> Since there is *no forward checking* how can we get this information ?
>> We have visibility into single element and not the entire column.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Sep 4, 2016 at 9:30 AM, janardhan shetty 
>> wrote:
>>
>>> In scala Spark ML Dataframes.
>>>
>>> On Sun, Sep 4, 2016 at 9:16 AM, Somasundaram Sekar >> tigeranalytics.com> wrote:
>>>
 Can you try this

 https://www.linkedin.com/pulse/hive-functions-udfudaf-
 udtf-examples-gaurav-singh

 On 4 Sep 2016 9:38 pm, "janardhan shetty" 
 wrote:

> Hi,
>
> Is there any chance that we can send entire multiple columns to an udf
> and generate a new column for Spark ML.
> I see similar approach as VectorAssembler but not able to use few
> classes /traitslike HasInputCols, HasOutputCol, DefaultParamsWritable 
> since
> they are private.
>
> Any leads/examples is appreciated in this regard..
>
> Requirement:
> *Input*: Multiple columns of a Dataframe
> *Output*:  Single new modified column
>

>>>
>>


Re: weightCol doesn't seem to be handled properly in PySpark

2016-09-12 Thread Evan Zamir
Yep, done. https://issues.apache.org/jira/browse/SPARK-17508

On Mon, Sep 12, 2016 at 9:06 AM Nick Pentreath 
wrote:

> Could you create a JIRA ticket for it?
>
> https://issues.apache.org/jira/browse/SPARK
>
> On Thu, 8 Sep 2016 at 07:50 evanzamir  wrote:
>
>> When I am trying to use LinearRegression, it seems that unless there is a
>> column specified with weights, it will raise a py4j error. Seems odd
>> because
>> supposedly the default is weightCol=None, but when I specifically pass in
>> weightCol=None to LinearRegression, I get this error.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/weightCol-doesn-t-seem-to-be-handled-properly-in-PySpark-tp27677.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Spark Java Heap Error

2016-09-12 Thread Baktaawar
Hi 

I even tried the dataframe.cache() action to carry out the cross tab
transformation. However still I get the 
same OOM error. 


recommender_ct.cache()
---
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 recommender_ct.cache()

/Users/i854319/spark/python/pyspark/sql/dataframe.pyc in cache(self)
375 """
376 self.is_cached = True
--> 377 self._jdf.cache()
378 return self
379 

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in
__call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814 
815 for temp_arg in temp_args:

/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(

Py4JJavaError: An error occurred while calling o40.cache.
: java.lang.OutOfMemoryError
at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
at
org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at
org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryColumnarTableScan.scala:129)
at
org.apache.spark.sql.execution.columnar.InMemoryRelation.(InMemoryColumnarTableScan.scala:118)
at
org.apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryColumnarTableScan.scala:41)
at
org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:93)
at
org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:60)
at
org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:84)
at org.apache.spark.sql.DataFrame.persist(DataFrame.scala:1581)
at org.apache.spark.sql.DataFrame.cache(DataFrame.scala:1590)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at 

Partition n keys into exacly n partitions

2016-09-12 Thread sujeet jog
Hi,

Is there a way to partition set of data with n keys into exactly n
partitions.

For ex : -

tuple of 1008 rows with key as x
tuple of 1008 rows with key as y   and so on  total 10 keys ( x, y etc )

Total records = 10080
NumOfKeys = 10

i want to partition the 10080 elements into exactly 10 partitions with each
partition having elements with unique key

Is there a way to make this happen ?.. any ideas on implementing custom
partitioner.


The current partitioner i'm using is HashPartitioner from which there are
cases where key.hascode() % numPartitions  for keys of x & y become same.

 hence many elements with different keys fall into single partition at
times.



Thanks,
Sujeet


Re: Spark transformations

2016-09-12 Thread Thunder Stumpges
Hi Janardhan,

I have run into similar issues and asked similar questions. I also ran into
many problems with private code when trying to write my own
Model/Transformer/Estimator. (you might be able to find my question to the
group regarding this, I can't really tell if my emails are getting through,
as I don't get any responses). For now I have resorted to copying out the
code that I need from the spark codebase and into mine. I'm certain this is
not the best, but it has to be better than "implementing it myself" which
was what the only response to my question said to do.

As for the transforms, I also asked a similar question. The only way I've
seen it done in code is using a UDF. As you mention, the UDF can only
access fields on a "row by row" basis. I have not gotten any replies at all
on my question, but I also need to do some more complicated operation in my
work (join to another model RDD, flat-map, calculate, reduce) in order to
get the value for the new column. So far no great solution.

Sorry I don't have any answers, but wanted to chime in that I am also a bit
stuck on similar issues. Hope we can find a workable solution soon.
Cheers,
Thunder



On Tue, Sep 6, 2016 at 1:32 PM janardhan shetty 
wrote:

> Noticed few things about Spark transformers just wanted to be clear.
>
> Unary transformer:
>
> createTransformFunc: IN => OUT  = { *item* => }
> Here *item *is single element and *NOT* entire column.
>
> I would like to get the number of elements in that particular column.
> Since there is *no forward checking* how can we get this information ?
> We have visibility into single element and not the entire column.
>
>
>
>
>
>
>
>
>
>
> On Sun, Sep 4, 2016 at 9:30 AM, janardhan shetty 
> wrote:
>
>> In scala Spark ML Dataframes.
>>
>> On Sun, Sep 4, 2016 at 9:16 AM, Somasundaram Sekar <
>> somasundar.se...@tigeranalytics.com> wrote:
>>
>>> Can you try this
>>>
>>>
>>> https://www.linkedin.com/pulse/hive-functions-udfudaf-udtf-examples-gaurav-singh
>>>
>>> On 4 Sep 2016 9:38 pm, "janardhan shetty" 
>>> wrote:
>>>
 Hi,

 Is there any chance that we can send entire multiple columns to an udf
 and generate a new column for Spark ML.
 I see similar approach as VectorAssembler but not able to use few
 classes /traitslike HasInputCols, HasOutputCol, DefaultParamsWritable since
 they are private.

 Any leads/examples is appreciated in this regard..

 Requirement:
 *Input*: Multiple columns of a Dataframe
 *Output*:  Single new modified column

>>>
>>
>


Re: Debugging a spark application in a none lazy mode

2016-09-12 Thread Attias, Hagai
Hi,
Not sure what you mean, can you give an example?

Hagai.

From: Takeshi Yamamuro 
Date: Monday, September 12, 2016 at 7:24 PM
To: Hagai Attias 
Cc: "user@spark.apache.org" 
Subject: Re: Debugging a spark application in a none lazy mode

Hi,

Spark does not have such mode.
How about getting local arrays by `collect` methods for debugging?

// maropu

On Tue, Sep 13, 2016 at 12:44 AM, Hagai 
> wrote:
Hi guys,
Lately I was looking for a way to debug my spark application locally.

However, since all transformations are actually being executed when the
action is encountered, I have no way to look at the data after each
transformation. Does spark support a non-lazy mode which enables to execute
the transformations locally after each statement?

Thanks,
Hagai.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Debugging-a-spark-application-in-a-none-lazy-mode-tp27695.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



--
---
Takeshi Yamamuro


Re: Getting figures from spark streaming

2016-09-12 Thread Thunder Stumpges
Just a guess, but doesn't the `.apply(0)' at the end of each of your print
statements take just the first one of the returned list?


On Wed, Sep 7, 2016 at 12:36 AM Ashok Kumar 
wrote:

> Any help on this warmly appreciated.
>
>
> On Tuesday, 6 September 2016, 21:31, Ashok Kumar
>  wrote:
>
>
> Hello Gurus,
>
> I am creating some figures and feed them into Kafka and then spark
> streaming.
>
> It works OK but I have the following issue.
>
> For now as a test I sent 5 prices in each batch interval. In the loop code
> this is what is hapening
>
>   dstream.foreachRDD { rdd =>
>  val x= rdd.count
>  i += 1
>  println(s"> rdd loop i is ${i}, number of lines is  ${x} <==")
>  if (x > 0) {
>println(s"processing ${x} records=")
>var words1 =
> rdd.map(_._2).map(_.split(',').view(0)).map(_.toInt).collect.apply(0)
> println (words1)
>var words2 =
> rdd.map(_._2).map(_.split(',').view(1)).map(_.toString).collect.apply(0)
> println (words2)
>var price =
> rdd.map(_._2).map(_.split(',').view(2)).map(_.toFloat).collect.apply(0)
> println (price)
> rdd.collect.foreach(println)
>}
>  }
>
> My tuple looks like this
>
> // (null, "ID   TIMESTAMP   PRICE")
> // (null, "40,20160426-080924,  67.55738301621814598514")
>
> And this the sample output from the run
>
> processing 5 records=
> 3
> 20160906-212509
> 80.224686
> (null,3,20160906-212509,80.22468448052631637099)
> (null,1,20160906-212509,60.40695324215582386153)
> (null,4,20160906-212509,61.95159400693415572125)
> (null,2,20160906-212509,93.05912099305473237788)
> (null,5,20160906-212509,81.08637370113427387121)
>
> Now it does process the first values 3, 20160906-212509, 80.224686  for
> record (null,3,20160906-212509,80.22468448052631637099)
> but ignores the rest. of 4 records. How can I make it go through all
> records here? I want the third column from all records!
>
> Greetings
>
>
>
>
>
>


Re: Debugging a spark application in a none lazy mode

2016-09-12 Thread Takeshi Yamamuro
Hi,

Spark does not have such mode.
How about getting local arrays by `collect` methods for debugging?

// maropu

On Tue, Sep 13, 2016 at 12:44 AM, Hagai  wrote:

> Hi guys,
> Lately I was looking for a way to debug my spark application locally.
>
> However, since all transformations are actually being executed when the
> action is encountered, I have no way to look at the data after each
> transformation. Does spark support a non-lazy mode which enables to execute
> the transformations locally after each statement?
>
> Thanks,
> Hagai.
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Debugging-a-spark-application-
> in-a-none-lazy-mode-tp27695.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro


Re: weightCol doesn't seem to be handled properly in PySpark

2016-09-12 Thread Nick Pentreath
Could you create a JIRA ticket for it?

https://issues.apache.org/jira/browse/SPARK

On Thu, 8 Sep 2016 at 07:50 evanzamir  wrote:

> When I am trying to use LinearRegression, it seems that unless there is a
> column specified with weights, it will raise a py4j error. Seems odd
> because
> supposedly the default is weightCol=None, but when I specifically pass in
> weightCol=None to LinearRegression, I get this error.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/weightCol-doesn-t-seem-to-be-handled-properly-in-PySpark-tp27677.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Debugging a spark application in a none lazy mode

2016-09-12 Thread Hagai
Hi guys,
Lately I was looking for a way to debug my spark application locally.

However, since all transformations are actually being executed when the
action is encountered, I have no way to look at the data after each
transformation. Does spark support a non-lazy mode which enables to execute
the transformations locally after each statement?

Thanks,
Hagai.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Debugging-a-spark-application-in-a-none-lazy-mode-tp27695.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Re: Selecting the top 100 records per group by?

2016-09-12 Thread Mich Talebzadeh
Hi,

I don't understand why you need to add a column row_number when you can use
rank or dens_rank?

Why  one cannot one use rank or dens_rank here?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 September 2016 at 15:37,  wrote:

> hi kevin
> window function is what you need, like below:
> val hivetable = hc.sql("select * from house_sale_pv_location")
> val overLocation = Window.partitionBy(hivetable.col("lp_location_id"))
> val sortedDF = hivetable.withColumn("rowNumber", row_number().over(
> overLocation)).filter("rowNumber<=50")
>
> here I add a column as rownumber,  get all data partitioned and get the
> top 50 rows.
>
>
>
> 
>
> ThanksBest regards!
> San.Luo
>
> - 原始邮件 -
> 发件人:Mich Talebzadeh 
> 收件人:"user @spark" 
> 主题:Re: Selecting the top 100 records per group by?
> 日期:2016年09月11日 22点20分
>
> You can of course do this using FP.
>
> val wSpec = Window.partitionBy('price).orderBy(desc("price"))
> df2.filter('security > " 
> ").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED,
> 'SECURITY, substring('PRICE,1,7)).filter('rank<=10).show
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 September 2016 at 07:15, Mich Talebzadeh 
> wrote:
>
> DENSE_RANK will give you ordering and sequence within a particular column.
> This is Hive
>
>  var sqltext = """
>  | SELECT RANK, timecreated,security, price
>  |  FROM (
>  |SELECT timecreated,security, price,
>  |   DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
>  |  FROM test.prices
>  |   ) tmp
>  |  WHERE rank <= 10
>  | """
> sql(sqltext).collect.foreach(println)
>
> [1,2016-09-09 16:55:44,Esso,99.995]
> [1,2016-09-09 21:22:52,AVIVA,99.995]
> [1,2016-09-09 21:22:52,Barclays,99.995]
> [1,2016-09-09 21:24:28,JPM,99.995]
> [1,2016-09-09 21:30:38,Microsoft,99.995]
> [1,2016-09-09 21:31:12,UNILEVER,99.995]
> [2,2016-09-09 16:54:14,BP,99.99]
> [2,2016-09-09 16:54:36,Tate & Lyle,99.99]
> [2,2016-09-09 16:56:28,EASYJET,99.99]
> [2,2016-09-09 16:59:28,IBM,99.99]
> [2,2016-09-09 20:16:10,EXPERIAN,99.99]
> [2,2016-09-09 22:25:20,Microsoft,99.99]
> [2,2016-09-09 22:53:49,Tate & Lyle,99.99]
> [3,2016-09-09 15:31:06,UNILEVER,99.985]
>
> HTH
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 September 2016 at 04:32, Kevin Burton  wrote:
>
> Looks like you can do it with dense_rank functions.
>
> https://databricks.com/blog/2015/07/15/introducing-window-fu
> nctions-in-spark-sql.html
>
> I setup some basic records and seems like it did the right thing.
>
> Now time to throw 50TB and 100 spark nodes at this problem and see what
> happens :)
>
> On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton  wrote:
>
> Ah.. might actually. I'll have to mess around with that.
>
> On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley  wrote:
>
> Would `topByKey` help?
>
> https://github.com/apache/spark/blob/master/mllib/src/main/s
> cala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42
>
> Best,
> Karl
>
> On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton  wrote:
>
> I'm trying to figure out a way to group by and return the top 100 records
> 

Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
Hi Ayan,

"My problem is to get data on to HDFS for the first time."

well, you have to put them on the cluster. With this simple command you can
load them within HDFS:

hdfs dfs -put $LOCAL_SRC_DIR $HDFS_PATH

Then, i think you have to use coalesce in order to create an uber super
mega file :) but i did not have to do it in my life, yet, so, maybe i am
wrong.

Please, take a look to this post and let us know about how you deal with it.

https://stuartsierra.com/2008/04/24/a-million-little-files


http://blog.cloudera.com/blog/2009/02/the-small-files-problem/


"One way I can think of is to load small files on each cluster machines on
the same folder. For example load file 1-0.3 mil on machine 1, 0.3-0.6 mil
on machine 2 and so on. Then I can run spark jobs which will locally read
files. "


Well, hadoop does not work such that way, when you load data within a
hadoop cluster, data are going to be allocated between every machine
belonging to your cluster, and the files are going to be splitted between
machines. I think you are trying to talk about data locality, isn't ?

"Any better solution? Can flume help here?"

Of course Flume can do the job, but you still will have the small files
problem anyway. You have to create an uber file before you upload it to the
HDFS.

Regards




Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-09-12 14:11 GMT+02:00 ayan guha :

> Hi
>
> Thanks for your mail. I have read few of those posts. But always I see
> solutions assume data is on hdfs already. My problem is to get data on to
> HDFS for the first time.
>
> One way I can think of is to load small files on each cluster machines on
> the same folder. For example load file 1-0.3 mil on machine 1, 0.3-0.6 mil
> on machine 2 and so on. Then I can run spark jobs which will locally read
> files.
>
> Any better solution? Can flume help here?
>
> Any idea is appreciated.
>
> Best
> Ayan
> On 12 Sep 2016 20:54, "Alonso Isidoro Roman"  wrote:
>
>> That is a good question Ayan. A few searches on so returns me:
>>
>> http://stackoverflow.com/questions/31009834/merge-multiple-
>> small-files-in-to-few-larger-files-in-spark
>>
>> http://stackoverflow.com/questions/29025147/how-can-i-merge-
>> spark-results-files-without-repartition-and-copymerge
>>
>>
>> good luck, tell us something about this issue
>>
>> Alonso
>>
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> 
>>
>> 2016-09-12 12:39 GMT+02:00 ayan guha :
>>
>>> Hi
>>>
>>> I have a general question: I have 1.6 mil small files, about 200G all
>>> put together. I want to put them on hdfs for spark processing.
>>> I know sequence file is the way to go because putting small files on
>>> hdfs is not correct practice. Also, I can write a code to consolidate small
>>> files to seq files locally.
>>> My question: is there any way to do this in parallel, for example using
>>> spark or mr or anything else.
>>>
>>> Thanks
>>> Ayan
>>>
>>
>>


回复:Re: Selecting the top 100 records per group by?

2016-09-12 Thread luohui20001
hi kevinwindow function is what you need, like below:val hivetable = 
hc.sql("select * from house_sale_pv_location")
val overLocation = Window.partitionBy(hivetable.col("lp_location_id"))
val sortedDF = hivetable.withColumn("rowNumber", 
row_number().over(overLocation)).filter("rowNumber<=50")
here I add a column as rownumber,  get all data partitioned and get the top 50 
rows.




 

ThanksBest regards!
San.Luo

- 原始邮件 -
发件人:Mich Talebzadeh 
收件人:"user @spark" 
主题:Re: Selecting the top 100 records per group by?
日期:2016年09月11日 22点20分

You can of course do this using FP.
val wSpec = 
Window.partitionBy('price).orderBy(desc("price"))df2.filter('security > " 
").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED, 'SECURITY, 
substring('PRICE,1,7)).filter('rank<=10).show


HTH



Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 11 September 2016 at 07:15, Mich Talebzadeh  
wrote:
DENSE_RANK will give you ordering and sequence within a particular column. This 
is Hive 
 var sqltext = """
 | SELECT RANK, timecreated,security, price
 |  FROM (
 |SELECT timecreated,security, price,
 |   DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
 |  FROM test.prices
 |   ) tmp
 |  WHERE rank <= 10
 | """
sql(sqltext).collect.foreach(println)[1,2016-09-09 16:55:44,Esso,99.995]
[1,2016-09-09 21:22:52,AVIVA,99.995]
[1,2016-09-09 21:22:52,Barclays,99.995]
[1,2016-09-09 21:24:28,JPM,99.995]
[1,2016-09-09 21:30:38,Microsoft,99.995]
[1,2016-09-09 21:31:12,UNILEVER,99.995]
[2,2016-09-09 16:54:14,BP,99.99]
[2,2016-09-09 16:54:36,Tate & Lyle,99.99]
[2,2016-09-09 16:56:28,EASYJET,99.99]
[2,2016-09-09 16:59:28,IBM,99.99]
[2,2016-09-09 20:16:10,EXPERIAN,99.99]
[2,2016-09-09 22:25:20,Microsoft,99.99]
[2,2016-09-09 22:53:49,Tate & Lyle,99.99]
[3,2016-09-09 15:31:06,UNILEVER,99.985]

HTH







Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 11 September 2016 at 04:32, Kevin Burton  wrote:
Looks like you can do it with dense_rank functions.
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

I setup some basic records and seems like it did the right thing.
Now time to throw 50TB and 100 spark nodes at this problem and see what happens 
:)
On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton  wrote:
Ah.. might actually. I'll have to mess around with that.
On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley  wrote:
Would `topByKey` help?

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42

Best,Karl
On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton  wrote:
I'm trying to figure out a way to group by and return the top 100 records in 
that group.
Something like:
SELECT TOP(100, user_id) FROM posts GROUP BY user_id;
But I can't really figure out the best way to do this... 
There is a FIRST and LAST aggregate function but this only returns one column.
I could do something like:
SELECT * FROM posts WHERE user_id IN ( /* select top users here */ ) LIMIT 100;
But that limit is applied for ALL the records. Not each individual user.  
The only other thing I can think of is to do a manual map reduce and then have 
the reducer only return the top 100 each time... 
Would LOVE some advice here... 
-- 
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com… or check out my Google+ profile




-- 
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com… or check out my Google+ profile




-- 
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com… or check out my Google+ 

RE: questions about using dapply

2016-09-12 Thread xingye
Hi, Felix
Thanks for the information. as in my previous email, I've made MARGIN 
capitalized and it worked with dapplyCollect, but it does not work in dapply.

 If I use dapply and put the original apply function as a function for 
dapply,cols_in <-dapply(df, function(x) {apply(x[, paste("cr_cd", 1:12, sep = 
"")], Margin=2, function(y){ y %in% c(61, 99)})},schema )The error shows Error 
in match.fun(FUN) : argument "FUN" is missing, with no default

From: felixcheun...@hotmail.com
To: user@spark.apache.org; tracy.up...@gmail.com
Subject: Re: questions about using dapply
Date: Sun, 11 Sep 2016 01:52:37 +







You might need MARGIN capitalized, this example works though:



c <- as.DataFrame(cars)
# rename the columns to c1, c2
c <- selectExpr(c, "speed as c1", "dist as c2")
cols_in <- dapplyCollect(c,
function(x) {apply(x[, paste("c", 1:2, sep = "")], MARGIN=2, FUN = function(y){ 
y %in% c(61, 99)})})
# dapplyCollect does not require the schema parameter








_

From: xingye 

Sent: Friday, September 9, 2016 10:35 AM

Subject: questions about using dapply

To: 









I have a question about using UDF in SparkR. I’m converting some R code into 
SparkR.





• The original R code is :

cols_in <- apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = 
"%in%", c(61, 99))





• If I use dapply and put the original apply function as a function for dapply,

cols_in <-dapply(df, 

function(x) {apply(x[, paste("cr_cd", 1:12, sep = "")], Margin=2, function(y){ 
y %in% c(61, 99)})},

schema )

The error shows Error in match.fun(FUN) : argument "FUN" is missing, with no 
default





• If I use spark.lapply, it still shows the error. It seems in spark, the 
column cr_cd1 is ambiguous.

cols_in <-spark.lapply(df[, paste("cr_cd", 1:12, sep = "")], function(x){ x 
%in% c(61, 99)})

 16/09/08 ERROR RBackendHandler: select on 3101 failed Error in 
invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
org.apache.spark.sql.AnalysisException: Reference 'cr_cd1' is ambiguous, could 
be: cr_cd1#2169L, cr_cd1#17787L.;









If I use dapplycollect, it works but it will lead to memory issue if data is 
big. how can the dapply work in my case?

wrapper = function(df){

out = apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", 
c(61, 99))

return(out)



}

cols_in <-dapplyCollect(df,wrapper)





  

Re: Spark tasks blockes randomly on standalone cluster

2016-09-12 Thread Denis Bolshakov
Hello,

I see such behavior from time to time.

Similar problem is described here:
http://apache-spark-user-list.1001560.n3.nabble.com/Executor-Memory-Task-hangs-td12377.html

We also use speculative as a workaround (our spark version is 1.6.0).

But I would like to share one of observations.
We have two jenkins, one uses java 7 and another java 8.

And sometimes I see the issue during integration testing on jenkins with
java 7 (and never on java 8)

So I really hope that the issue will disappear after we complete our java
migration.

Which java version do you use?

Best regards,
Denis

On 12 September 2016 at 15:31, bogdanbaraila 
wrote:

> We are having a quite complex application that runs on Spark Standalone.
> In some cases the tasks from one of the workers blocks randomly for an
> infinite amount of time in the RUNNING state.
>  SparkStandaloneIssue.png>
>
>
> Extra info:
> - there aren't any errors in the logs
> - ran with logger in debug and i didn't saw any relevant messages (i see
> when the tasks starts but then there is not activity for it)
> - the jobs are working ok if i have just only 1 worker
> - the same job may execute the second time without any issues, in a proper
> amount of time
> - i don't have any really big partitions that could  cause delays for some
> of the tasks.
> - in spark 2.0 i've moved from RDD to Datasets and i have the same issue
> - in spark 1.4 i was able to overcome the issue by turning on speculation,
> but in spark 2.0 the blocking tasks are from different workers (while in
> 1.4
> i have blocking tasks on only 1 worker) so speculation isn't fixing my
> issue.
> - i have the issue on more environments so i don't think it's hardware
> related.
>
> Did anyone experienced something similar? Any suggestions on how could i
> identify the issue?
>
> Thanks a lot!
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-tasks-blockes-randomly-on-
> standalone-cluster-tp27693.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
//with Best Regards
--Denis Bolshakov
e-mail: bolshakov.de...@gmail.com


Re: Spark metrics when running with YARN?

2016-09-12 Thread Vladimir Tretyakov
Hello Saisai Shao, Jacek Laskowski , thx for information.

We are working on Spark monitoring tool and our users have different setup
modes (Standalone, Mesos, YARN).

Looked at code, found:

/**
 * Attempt to start a Jetty server bound to the supplied hostName:port
using the given
 * context handlers.
 *
 * If the desired port number is contended, continues
*incrementing ports until a free port is** * found*. Return the jetty
Server object, the chosen port, and a mutable collection of handlers.
 */

It seems most generic way (which will work for most users) will be start
looking at ports:

spark.ui.port (4040 by default)
spark.ui.port + 1
spark.ui.port + 2
spark.ui.port + 3
...

Until we will get responses from Spark.

PS: yeah they may be some intersections with some other applications for
some setups, in this case we may ask users about these exceptions and do
our housework around them.

Best regards, Vladimir.

On Mon, Sep 12, 2016 at 12:07 PM, Saisai Shao 
wrote:

> Here is the yarn RM REST API for you to refer (http://hadoop.apache.org/
> docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html). You
> can use these APIs to query applications running on yarn.
>
> On Sun, Sep 11, 2016 at 11:25 PM, Jacek Laskowski  wrote:
>
>> Hi Vladimir,
>>
>> You'd have to talk to your cluster manager to query for all the
>> running Spark applications. I'm pretty sure YARN and Mesos can do that
>> but unsure about Spark Standalone. This is certainly not something a
>> Spark application's web UI could do for you since it is designed to
>> handle the single Spark application.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sun, Sep 11, 2016 at 11:18 AM, Vladimir Tretyakov
>>  wrote:
>> > Hello Jacek, thx a lot, it works.
>> >
>> > Is there a way how to get list of running applications from REST API?
>> Or I
>> > have to try connect 4040 4041... 40xx ports and check if ports answer
>> > something?
>> >
>> > Best regards, Vladimir.
>> >
>> > On Sat, Sep 10, 2016 at 6:00 AM, Jacek Laskowski 
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> That's correct. One app one web UI. Open 4041 and you'll see the other
>> >> app.
>> >>
>> >> Jacek
>> >>
>> >>
>> >> On 9 Sep 2016 11:53 a.m., "Vladimir Tretyakov"
>> >>  wrote:
>> >>>
>> >>> Hello again.
>> >>>
>> >>> I am trying to play with Spark version "2.11-2.0.0".
>> >>>
>> >>> Problem that REST API and UI shows me different things.
>> >>>
>> >>> I've stared 2 applications from "examples set": opened 2 consoles and
>> run
>> >>> following command in each:
>> >>>
>> >>> ./bin/spark-submit   --class org.apache.spark.examples.SparkPi
>>  --master
>> >>> spark://wawanawna:7077  --executor-memory 2G  --total-executor-cores
>> 30
>> >>> examples/jars/spark-examples_2.11-2.0.0.jar  1
>> >>>
>> >>> Request to API endpoint:
>> >>>
>> >>> http://localhost:4040/api/v1/applications
>> >>>
>> >>> returned me following JSON:
>> >>>
>> >>> [ {
>> >>>   "id" : "app-20160909184529-0016",
>> >>>   "name" : "Spark Pi",
>> >>>   "attempts" : [ {
>> >>> "startTime" : "2016-09-09T15:45:25.047GMT",
>> >>> "endTime" : "1969-12-31T23:59:59.999GMT",
>> >>> "lastUpdated" : "2016-09-09T15:45:25.047GMT",
>> >>> "duration" : 0,
>> >>> "sparkUser" : "",
>> >>> "completed" : false,
>> >>> "startTimeEpoch" : 1473435925047,
>> >>> "endTimeEpoch" : -1,
>> >>> "lastUpdatedEpoch" : 1473435925047
>> >>>   } ]
>> >>> } ]
>> >>>
>> >>> so response contains information only about 1 application.
>> >>>
>> >>> But in reality I've started 2 applications and Spark UI shows me 2
>> >>> RUNNING application (please see screenshot).
>> >>>
>> >>> Does anybody maybe know answer why API and UI shows different things?
>> >>>
>> >>>
>> >>> Best regards, Vladimir.
>> >>>
>> >>>
>> >>> On Tue, Aug 30, 2016 at 3:52 PM, Vijay Kiran 
>> wrote:
>> 
>>  Hi Otis,
>> 
>>  Did you check the REST API as documented in
>>  http://spark.apache.org/docs/latest/monitoring.html
>> 
>>  Regards,
>>  Vijay
>> 
>>  > On 30 Aug 2016, at 14:43, Otis Gospodnetić
>>  >  wrote:
>>  >
>>  > Hi Mich and Vijay,
>>  >
>>  > Thanks!  I forgot to include an important bit - I'm looking for a
>>  > programmatic way to get Spark metrics when running Spark under
>> YARN - so JMX
>>  > or API of some kind.
>>  >
>>  > Thanks,
>>  > Otis
>>  > --
>>  > Monitoring - Log Management - Alerting - Anomaly Detection
>>  > Solr & Elasticsearch Consulting Support Training -
>>  > http://sematext.com/
>>  >
>>  >
>>  > On Tue, Aug 30, 2016 at 6:59 AM, Mich Talebzadeh
>>  > 

Spark tasks blockes randomly on standalone cluster

2016-09-12 Thread bogdanbaraila
We are having a quite complex application that runs on Spark Standalone.
In some cases the tasks from one of the workers blocks randomly for an
infinite amount of time in the RUNNING state.

  


Extra info:
- there aren't any errors in the logs
- ran with logger in debug and i didn't saw any relevant messages (i see
when the tasks starts but then there is not activity for it)
- the jobs are working ok if i have just only 1 worker
- the same job may execute the second time without any issues, in a proper
amount of time
- i don't have any really big partitions that could  cause delays for some
of the tasks.
- in spark 2.0 i've moved from RDD to Datasets and i have the same issue
- in spark 1.4 i was able to overcome the issue by turning on speculation,
but in spark 2.0 the blocking tasks are from different workers (while in 1.4
i have blocking tasks on only 1 worker) so speculation isn't fixing my
issue.
- i have the issue on more environments so i don't think it's hardware
related.

Did anyone experienced something similar? Any suggestions on how could i
identify the issue?

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-tasks-blockes-randomly-on-standalone-cluster-tp27693.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Small files

2016-09-12 Thread ayan guha
Hi

Thanks for your mail. I have read few of those posts. But always I see
solutions assume data is on hdfs already. My problem is to get data on to
HDFS for the first time.

One way I can think of is to load small files on each cluster machines on
the same folder. For example load file 1-0.3 mil on machine 1, 0.3-0.6 mil
on machine 2 and so on. Then I can run spark jobs which will locally read
files.

Any better solution? Can flume help here?

Any idea is appreciated.

Best
Ayan
On 12 Sep 2016 20:54, "Alonso Isidoro Roman"  wrote:

> That is a good question Ayan. A few searches on so returns me:
>
> http://stackoverflow.com/questions/31009834/merge-
> multiple-small-files-in-to-few-larger-files-in-spark
>
> http://stackoverflow.com/questions/29025147/how-can-i-
> merge-spark-results-files-without-repartition-and-copymerge
>
>
> good luck, tell us something about this issue
>
> Alonso
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-09-12 12:39 GMT+02:00 ayan guha :
>
>> Hi
>>
>> I have a general question: I have 1.6 mil small files, about 200G all put
>> together. I want to put them on hdfs for spark processing.
>> I know sequence file is the way to go because putting small files on hdfs
>> is not correct practice. Also, I can write a code to consolidate small
>> files to seq files locally.
>> My question: is there any way to do this in parallel, for example using
>> spark or mr or anything else.
>>
>> Thanks
>> Ayan
>>
>
>


Re: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-12 Thread Daniel Lopes
Thanks Steve,

But this error occurs only with parquet files, CSVs works.

Best,

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br


On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran 
wrote:

>
> On 9 Sep 2016, at 17:56, Daniel Lopes  wrote:
>
> Hi, someone can help
>
> I'm trying to use parquet in IBM Block Storage at Spark but when I try to
> load get this error:
>
> using this config
>
> credentials = {
>   "name": "keystone",
>   *"auth_url": "https://identity.open.softlayer.com
> ",*
>   "project": "object_storage_23f274c1_d11XXXe634",
>   "projectId": "XXd9c4aa39b7c7eb",
>   "region": "dallas",
>   "userId": "X64087180b40X2b909",
>   "username": "admin_9dd810f8901d48778XX",
>   "password": "chX6_",
>   "domainId": "c1ddad17cfcX41",
>   "domainName": "10XX",
>   "role": "admin"
> }
>
> def set_hadoop_config(credentials):
> """This function sets the Hadoop configuration with given credentials,
> so it is possible to access data using SparkContext"""
>
> prefix = "fs.swift.service." + credentials['name']
> hconf = sc._jsc.hadoopConfiguration()
> *hconf.set(prefix + ".auth.url",
> credentials['auth_url']+'/v3/auth/tokens')*
> hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
> hconf.set(prefix + ".tenant", credentials['projectId'])
> hconf.set(prefix + ".username", credentials['userId'])
> hconf.set(prefix + ".password", credentials['password'])
> hconf.setInt(prefix + ".http.port", 8080)
> hconf.set(prefix + ".region", credentials['region'])
> hconf.setBoolean(prefix + ".public", True)
>
> set_hadoop_config(credentials)
>
> -
>
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
> > 1 train.groupby('Acordo').count().show()
>
> *Py4JJavaError: An error occurred while calling* o406.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 60 in stage 30.0 failed 10 times, most recent failure: Lost task 60.9 in
> stage 30.0 (TID 2556, yp-spark-dal09-env5-0039): org.apache.hadoop.fs.swift.
> exceptions.SwiftConfigurationException:* Missing mandatory configuration
> option: fs.swift.service.keystone.auth.url*
>
>
>
> In my own code, I'd assume that the value of credentials['name'] didn't
> match that of the URL, assuming you have something like
> swift://bucket.keystone . Failing that: the options were set too late.
>
> Instead of asking for the hadoop config and editing that, set the option
> in your spark context, before it is launched, with the prefix "hadoop"
>
>
> at org.apache.hadoop.fs.swift.http.RestClientBindings.copy(
> RestClientBindings.java:223)
> at org.apache.hadoop.fs.swift.http.RestClientBindings.bind(
> RestClientBindings.java:147)
>
>
> *Daniel Lopes*
> Chief Data and Analytics Officer | OneMatch
> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes
>
> www.onematch.com.br
> 
>
>
>


Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
That is a good question Ayan. A few searches on so returns me:

http://stackoverflow.com/questions/31009834/merge-multiple-small-files-in-to-few-larger-files-in-spark

http://stackoverflow.com/questions/29025147/how-can-i-merge-spark-results-files-without-repartition-and-copymerge


good luck, tell us something about this issue

Alonso


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-09-12 12:39 GMT+02:00 ayan guha :

> Hi
>
> I have a general question: I have 1.6 mil small files, about 200G all put
> together. I want to put them on hdfs for spark processing.
> I know sequence file is the way to go because putting small files on hdfs
> is not correct practice. Also, I can write a code to consolidate small
> files to seq files locally.
> My question: is there any way to do this in parallel, for example using
> spark or mr or anything else.
>
> Thanks
> Ayan
>


Small files

2016-09-12 Thread ayan guha
Hi

I have a general question: I have 1.6 mil small files, about 200G all put
together. I want to put them on hdfs for spark processing.
I know sequence file is the way to go because putting small files on hdfs
is not correct practice. Also, I can write a code to consolidate small
files to seq files locally.
My question: is there any way to do this in parallel, for example using
spark or mr or anything else.

Thanks
Ayan


Unsubscribe

2016-09-12 Thread bijuna

> 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark word count program , need help on integration

2016-09-12 Thread gobi s
Hi,
  I am new to spark.
  I want to develop a word count app and deploy it in local mode. from
outside I want to trigger the program and get the word count output and
show it to the UI.

I need help on integration of Spark and outside.
  i) How to trigger the Spark app from the j2ee app
  ii) How to collect the result and give it back to the caller app.

Kindly let me know if you need any clarification on this.

I have searched many of the sites for the answer, Unfortunately I couldnt
find it.
It would be great if you have forwarded any link which talks about this.



-- 

\ Gobi.S '


Re: Spark metrics when running with YARN?

2016-09-12 Thread Saisai Shao
Here is the yarn RM REST API for you to refer (
http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html).
You can use these APIs to query applications running on yarn.

On Sun, Sep 11, 2016 at 11:25 PM, Jacek Laskowski  wrote:

> Hi Vladimir,
>
> You'd have to talk to your cluster manager to query for all the
> running Spark applications. I'm pretty sure YARN and Mesos can do that
> but unsure about Spark Standalone. This is certainly not something a
> Spark application's web UI could do for you since it is designed to
> handle the single Spark application.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Sep 11, 2016 at 11:18 AM, Vladimir Tretyakov
>  wrote:
> > Hello Jacek, thx a lot, it works.
> >
> > Is there a way how to get list of running applications from REST API? Or
> I
> > have to try connect 4040 4041... 40xx ports and check if ports answer
> > something?
> >
> > Best regards, Vladimir.
> >
> > On Sat, Sep 10, 2016 at 6:00 AM, Jacek Laskowski 
> wrote:
> >>
> >> Hi,
> >>
> >> That's correct. One app one web UI. Open 4041 and you'll see the other
> >> app.
> >>
> >> Jacek
> >>
> >>
> >> On 9 Sep 2016 11:53 a.m., "Vladimir Tretyakov"
> >>  wrote:
> >>>
> >>> Hello again.
> >>>
> >>> I am trying to play with Spark version "2.11-2.0.0".
> >>>
> >>> Problem that REST API and UI shows me different things.
> >>>
> >>> I've stared 2 applications from "examples set": opened 2 consoles and
> run
> >>> following command in each:
> >>>
> >>> ./bin/spark-submit   --class org.apache.spark.examples.SparkPi
>  --master
> >>> spark://wawanawna:7077  --executor-memory 2G  --total-executor-cores 30
> >>> examples/jars/spark-examples_2.11-2.0.0.jar  1
> >>>
> >>> Request to API endpoint:
> >>>
> >>> http://localhost:4040/api/v1/applications
> >>>
> >>> returned me following JSON:
> >>>
> >>> [ {
> >>>   "id" : "app-20160909184529-0016",
> >>>   "name" : "Spark Pi",
> >>>   "attempts" : [ {
> >>> "startTime" : "2016-09-09T15:45:25.047GMT",
> >>> "endTime" : "1969-12-31T23:59:59.999GMT",
> >>> "lastUpdated" : "2016-09-09T15:45:25.047GMT",
> >>> "duration" : 0,
> >>> "sparkUser" : "",
> >>> "completed" : false,
> >>> "startTimeEpoch" : 1473435925047,
> >>> "endTimeEpoch" : -1,
> >>> "lastUpdatedEpoch" : 1473435925047
> >>>   } ]
> >>> } ]
> >>>
> >>> so response contains information only about 1 application.
> >>>
> >>> But in reality I've started 2 applications and Spark UI shows me 2
> >>> RUNNING application (please see screenshot).
> >>>
> >>> Does anybody maybe know answer why API and UI shows different things?
> >>>
> >>>
> >>> Best regards, Vladimir.
> >>>
> >>>
> >>> On Tue, Aug 30, 2016 at 3:52 PM, Vijay Kiran 
> wrote:
> 
>  Hi Otis,
> 
>  Did you check the REST API as documented in
>  http://spark.apache.org/docs/latest/monitoring.html
> 
>  Regards,
>  Vijay
> 
>  > On 30 Aug 2016, at 14:43, Otis Gospodnetić
>  >  wrote:
>  >
>  > Hi Mich and Vijay,
>  >
>  > Thanks!  I forgot to include an important bit - I'm looking for a
>  > programmatic way to get Spark metrics when running Spark under YARN
> - so JMX
>  > or API of some kind.
>  >
>  > Thanks,
>  > Otis
>  > --
>  > Monitoring - Log Management - Alerting - Anomaly Detection
>  > Solr & Elasticsearch Consulting Support Training -
>  > http://sematext.com/
>  >
>  >
>  > On Tue, Aug 30, 2016 at 6:59 AM, Mich Talebzadeh
>  >  wrote:
>  > Spark UI regardless of deployment mode Standalone, yarn etc runs on
>  > port 4040 by default that can be accessed directly
>  >
>  > Otherwise one can specify a specific port with --conf
>  > "spark.ui.port=5" for example 5
>  >
>  > HTH
>  >
>  > Dr Mich Talebzadeh
>  >
>  > LinkedIn
>  > https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  >
>  > http://talebzadehmich.wordpress.com
>  >
>  > Disclaimer: Use it at your own risk. Any and all responsibility for
>  > any loss, damage or destruction of data or any other property which
> may
>  > arise from relying on this email's technical content is explicitly
>  > disclaimed. The author will in no case be liable for any monetary
> damages
>  > arising from such loss, damage or destruction.
>  >
>  >
>  > On 30 August 2016 at 11:48, Vijay Kiran 
> wrote:
>  >
>  > From Yarm RM UI, find the spark application Id, and in the
> application
>  > details, you can click on the “Tracking URL” which 

Re: Using Zeppelin with Spark FP

2016-09-12 Thread Sachin Janani
Yes zeppelin 0.6.1 works properly with Spark 2.0

On Mon, Sep 12, 2016 at 1:10 PM, Mich Talebzadeh 
wrote:

> Does Zeppelin work OK with Spark 2?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 12 September 2016 at 08:24, Sachin Janani 
> wrote:
>
>> Zeppelin imports ZeppelinContext object in spark interpreter using which
>> you can plot dataframe,dataset and even rdd.To do so you just need to use
>> "z.show(df)" in the paragraph (here df is the Dataframe which you want to
>> plot)
>>
>>
>> Regards,
>> Sachin Janani
>>
>> On Mon, Sep 12, 2016 at 11:20 AM, andy petrella 
>> wrote:
>>
>>> Heya, probably worth giving the Spark Notebook
>>>  a go then.
>>> It can plot any scala data (collection, rdd, df, ds, custom, ...), all
>>> are reactive so they can deal with any sort of incoming data. You can ask
>>> on the gitter  if you
>>> like.
>>>
>>> hth
>>> cheers
>>>
>>> On Sun, Sep 11, 2016 at 11:12 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Zeppelin is getting better.

 In its description it says:

 [image: image.png]

 So far so good. One feature that I have not managed to work out is
 creating plots with Spark functional programming. I can get SQL going by
 connecting to Spark thrift server and you can plot the results

 [image: image.png]

 However, if I wrote that using functional programming I won't be able
 to plot it. the plot feature is not available.

 Is this correct or I am missing something?

 Thanks

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



>>> --
>>> andy
>>>
>>
>>
>


Re: Using Zeppelin with Spark FP

2016-09-12 Thread Mich Talebzadeh
Does Zeppelin work OK with Spark 2?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 September 2016 at 08:24, Sachin Janani  wrote:

> Zeppelin imports ZeppelinContext object in spark interpreter using which
> you can plot dataframe,dataset and even rdd.To do so you just need to use
> "z.show(df)" in the paragraph (here df is the Dataframe which you want to
> plot)
>
>
> Regards,
> Sachin Janani
>
> On Mon, Sep 12, 2016 at 11:20 AM, andy petrella 
> wrote:
>
>> Heya, probably worth giving the Spark Notebook
>>  a go then.
>> It can plot any scala data (collection, rdd, df, ds, custom, ...), all
>> are reactive so they can deal with any sort of incoming data. You can ask
>> on the gitter  if you
>> like.
>>
>> hth
>> cheers
>>
>> On Sun, Sep 11, 2016 at 11:12 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Zeppelin is getting better.
>>>
>>> In its description it says:
>>>
>>> [image: image.png]
>>>
>>> So far so good. One feature that I have not managed to work out is
>>> creating plots with Spark functional programming. I can get SQL going by
>>> connecting to Spark thrift server and you can plot the results
>>>
>>> [image: image.png]
>>>
>>> However, if I wrote that using functional programming I won't be able to
>>> plot it. the plot feature is not available.
>>>
>>> Is this correct or I am missing something?
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>> --
>> andy
>>
>
>


Re: Using Zeppelin with Spark FP

2016-09-12 Thread Sachin Janani
Zeppelin imports ZeppelinContext object in spark interpreter using which
you can plot dataframe,dataset and even rdd.To do so you just need to use
"z.show(df)" in the paragraph (here df is the Dataframe which you want to
plot)


Regards,
Sachin Janani

On Mon, Sep 12, 2016 at 11:20 AM, andy petrella 
wrote:

> Heya, probably worth giving the Spark Notebook
>  a go then.
> It can plot any scala data (collection, rdd, df, ds, custom, ...), all are
> reactive so they can deal with any sort of incoming data. You can ask on
> the gitter  if you like.
>
> hth
> cheers
>
> On Sun, Sep 11, 2016 at 11:12 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> Zeppelin is getting better.
>>
>> In its description it says:
>>
>> [image: image.png]
>>
>> So far so good. One feature that I have not managed to work out is
>> creating plots with Spark functional programming. I can get SQL going by
>> connecting to Spark thrift server and you can plot the results
>>
>> [image: image.png]
>>
>> However, if I wrote that using functional programming I won't be able to
>> plot it. the plot feature is not available.
>>
>> Is this correct or I am missing something?
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
> --
> andy
>