Re: Unable to set cores while submitting Spark job

2016-03-31 Thread Mich Talebzadeh
Hi Shridhar

Can you check on Spark GUI whether the number of cores shown per worker is
the same as you set up? This shows under column "Cores"

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 1 April 2016 at 05:59, vetal king  wrote:

> Ted, Mich,
>
> Thanks for your replies. I ended up using sparkConf.set(); and
> accepted cores as a parameter. But still not sure why spark-submits's
> executor-cores or driver-cores property did not work. setting cores within
> main method seems to be bit cumbersome .
>
> Thanks again,
> Shridhar
>
>
>
> On Wed, Mar 30, 2016 at 8:42 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Ted
>>
>> Can specify the core as follows for example 12 cores?:
>>
>>   val conf = new SparkConf().
>>setAppName("ImportStat").
>>
>> *setMaster("local[12]").*
>> set("spark.driver.allowMultipleContexts", "true").
>>set("spark.hadoop.validateOutputSpecs", "false")
>>   val sc = new SparkContext(conf)
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 30 March 2016 at 14:59, Ted Yu  wrote:
>>
>>> -c CORES, --cores CORES Total CPU cores to allow Spark applications to
>>> use on the machine (default: all available); only on worker
>>>
>>> bq. sc.getConf().set()
>>>
>>> I think you should use this pattern (shown in
>>> https://spark.apache.org/docs/latest/spark-standalone.html):
>>>
>>> val conf = new SparkConf()
>>>  .setMaster(...)
>>>  .setAppName(...)
>>>  .set("spark.cores.max", "1")val sc = new SparkContext(conf)
>>>
>>>
>>> On Wed, Mar 30, 2016 at 5:46 AM, vetal king 
>>> wrote:
>>>
 Hi all,

 While submitting Spark Job I am am specifying options --executor-cores
 1  and --driver-cores 1. However, when the job was submitted, the job used
 all available cores. So I tried to limit the cores within my main function
 sc.getConf().set("spark.cores.max", "1"); however it still used
 all available cores

 I am using Spark in standalone mode (spark://:7077)

 Any idea what I am missing?
 Thanks in Advance,

 Shridhar


>>>
>>
>


Re: Unable to set cores while submitting Spark job

2016-03-31 Thread vetal king
Ted, Mich,

Thanks for your replies. I ended up using sparkConf.set(); and
accepted cores as a parameter. But still not sure why spark-submits's
executor-cores or driver-cores property did not work. setting cores within
main method seems to be bit cumbersome .

Thanks again,
Shridhar


On Wed, Mar 30, 2016 at 8:42 PM, Mich Talebzadeh 
wrote:

> Hi Ted
>
> Can specify the core as follows for example 12 cores?:
>
>   val conf = new SparkConf().
>setAppName("ImportStat").
>
> *setMaster("local[12]").*
> set("spark.driver.allowMultipleContexts", "true").
>set("spark.hadoop.validateOutputSpecs", "false")
>   val sc = new SparkContext(conf)
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 March 2016 at 14:59, Ted Yu  wrote:
>
>> -c CORES, --cores CORES Total CPU cores to allow Spark applications to
>> use on the machine (default: all available); only on worker
>>
>> bq. sc.getConf().set()
>>
>> I think you should use this pattern (shown in
>> https://spark.apache.org/docs/latest/spark-standalone.html):
>>
>> val conf = new SparkConf()
>>  .setMaster(...)
>>  .setAppName(...)
>>  .set("spark.cores.max", "1")val sc = new SparkContext(conf)
>>
>>
>> On Wed, Mar 30, 2016 at 5:46 AM, vetal king  wrote:
>>
>>> Hi all,
>>>
>>> While submitting Spark Job I am am specifying options --executor-cores
>>> 1  and --driver-cores 1. However, when the job was submitted, the job used
>>> all available cores. So I tried to limit the cores within my main function
>>> sc.getConf().set("spark.cores.max", "1"); however it still used
>>> all available cores
>>>
>>> I am using Spark in standalone mode (spark://:7077)
>>>
>>> Any idea what I am missing?
>>> Thanks in Advance,
>>>
>>> Shridhar
>>>
>>>
>>
>


How to release data frame to avoid memory leak

2016-03-31 Thread kramer2...@126.com
Hi

I have data frames created every 5 minutes. I use a dict to keep the recent
1 hour data frames. So only 12 data frame can be kept in the dict. New data
frame come in, old data frame pop out.

My question is when I pop out the old data frame, do I have to call
dataframe.unpersist to release the memory?

For example 

If currentTime == fiveMinutes:

myDict[currentTime] = dataframe

oldestDataFrame = myDict.pop(oldest)

Now do I have to call oldestDataFrame.unpresist?  Because I think python
will automatically release unused variable






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-release-data-frame-to-avoid-memory-leak-tp26656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SPARK-1.6 build with HIVE

2016-03-31 Thread guoqing0...@yahoo.com.hk.INVALID
Hi, I'd like to know is the SPARK-1.6 only support the hive-0.13 or can build 
with higher versions like 1.x ?



guoqing0...@yahoo.com.hk


Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Abhishek Anand
This is what I am getting in the executor logs

16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while
reverting partial writes to file
/data/spark-e2fc248f-a212-4a99-9d6c-4e52d6a69070/executor-37679a6c-cb96-451e-a284-64d6b4fe9910/blockmgr-f8ca72f4-f329-468b-8e65-ef97f8fb285c/38/temp_shuffle_8f266d70-3fc6-41e5-bbaa-c413a7b08ea4
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:315)
at
org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:274)



It happens every time the disk is full.

On Fri, Apr 1, 2016 at 2:18 AM, Ted Yu  wrote:

> Can you show the stack trace ?
>
> The log message came from
> DiskBlockObjectWriter#revertPartialWritesAndClose().
> Unfortunately, the method doesn't throw exception, making it a bit hard
> for caller to know of the disk full condition.
>
> On Thu, Mar 31, 2016 at 11:32 AM, Abhishek Anand 
> wrote:
>
>>
>> Hi,
>>
>> Why is it so that when my disk space is full on one of the workers then
>> the executor on that worker becomes unresponsive and the jobs on that
>> worker fails with the exception
>>
>>
>> 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while
>> reverting partial writes to file
>> /data/spark-e2fc248f-a212-4a99-9d6c-4e52d6a69070/executor-37679a6c-cb96-451e-a284-64d6b4fe9910/blockmgr-f8ca72f4-f329-468b-8e65-ef97f8fb285c/38/temp_shuffle_8f266d70-3fc6-41e5-bbaa-c413a7b08ea4
>> java.io.IOException: No space left on device
>>
>>
>> This is leading to my job getting stuck.
>>
>> As a workaround I have to kill the executor, clear the space on disk and
>> new executor  relaunched by the worker and the failed stages are recomputed.
>>
>>
>> How can I get rid of this problem i.e why my job get stuck on disk full
>> issue on one of the workers ?
>>
>>
>> Cheers !!!
>> Abhi
>>
>>
>


spark-shell failing but pyspark works

2016-03-31 Thread Cyril Scetbon
Hi,

I'm having issues to create a StreamingContext with Scala using spark-shell. It 
tries to access the localhost interface and the Application Master is not 
running on this interface :

ERROR ApplicationMaster: Failed to connect to driver at localhost:47257, 
retrying ...

I don't have the issue with Python and pyspark which works fine (you can see it 
uses the ip address) : 

ApplicationMaster: Driver now available: 192.168.10.100:43290

I use similar codes though :

test.scala :
--

import org.apache.spark._
import org.apache.spark.streaming._
val app = "test-scala"
val conf = new SparkConf().setAppName(app).setMaster("yarn-client")
val ssc = new StreamingContext(conf, Seconds(3))

command used : spark-shell -i test.scala

test.py :
---

from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
app = "test-python"
conf = SparkConf().setAppName(app).setMaster("yarn-client")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 3)

command used : pyspark test.py

Any idea why scala can't instantiate it ? I thought python was barely using 
scala under the hood, but it seems there are differences. Are there any 
parameters set using Scala but not Python ? 

Thanks
-- 
Cyril SCETBON


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Execution error during ALS execution in spark

2016-03-31 Thread buring
I have some suggestions you may try
1) input RDD ,use the persist method ,this may much save running time
2) from the UI,you can see cluster spend much time  in shuffle stage , this
can adjust through some conf parameters ,such as"
spark.shuffle.memoryFraction" "spark.memory.fraction"

good luck



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Execution-error-during-ALS-execution-in-spark-tp26644p26652.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Apache Spark-Get All Field Names From Nested Arbitrary JSON Files

2016-03-31 Thread John Radin
Hello All-

I have run into a somewhat perplexing issue that has plagued me for several
months now (with haphazard workarounds). I am trying to create an Avro
Schema (schema-enforced format for serializing arbitrary data, basically,
as I understand it) to convert some complex JSON files (arbitrary and
nested) eventually to Parquet in a pipeline.

I am wondering if there is a way to get the superset of field names I need
for this use case staying in Apache Spark instead of Hadoop MR in a
reasonable fashion?

I think Apache Arrow under development might be able to help avoid this by
treating JSON as a first class citizen eventually, but it is still aways
off yet.

Any guidance would be sincerely appreciated!

Thanks!

John


Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-03-31 Thread Jacek Laskowski
Hi Ted,

Sure! It works with map, but not with select. Wonder if it's by design
or...will soon be fixed? Thanks again for your help.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Mar 31, 2016 at 10:57 AM, Ted Yu  wrote:
> I tried this:
>
> scala> final case class Text(id: Int, text: String)
> warning: there was one unchecked warning; re-run with -unchecked for details
> defined class Text
>
> scala> val ds = Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text]
> ds: org.apache.spark.sql.Dataset[Text] = [id: int, text: string]
>
> scala> ds.map(t => t.id).show
> +-+
> |value|
> +-+
> |0|
> |1|
> +-+
>
> On Thu, Mar 31, 2016 at 5:02 AM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> I can't seem to use Dataset using case classes (or tuples) to select per
>> field:
>>
>> scala> final case class Text(id: Int, text: String)
>> warning: there was one unchecked warning; re-run with -unchecked for
>> details
>> defined class Text
>>
>> scala> val ds = Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text]
>> ds: org.apache.spark.sql.Dataset[Text] = [id: int, text: string]
>>
>> // query per field as symbol works fine
>> scala> ds.select('id).show
>> +---+
>> | id|
>> +---+
>> |  0|
>> |  1|
>> +---+
>>
>> // but not per field as Scala attribute
>> scala> ds.select(_.id).show
>> :40: error: missing parameter type for expanded function
>> ((x$1) => x$1.id)
>>ds.select(_.id).show
>>  ^
>>
>> Is this supposed to work in Spark 2.0 (today's build)?
>>
>> BTW, Why is Seq(Text(0, "hello"), Text(1, "world")).as[Text] not possible?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [SQL] A bug with withColumn?

2016-03-31 Thread Jacek Laskowski
Hi,

Thanks Ted.

It means that it's not only possible to rename a column using
withColumnRenamed, but also replace the content of a column (in one
shot) using withColumn with an existing column name. I can live with
that :)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Mar 31, 2016 at 5:14 PM, Ted Yu  wrote:
> Looks like this is result of the following check:
>
> val shouldReplace = output.exists(f => resolver(f.name, colName))
> if (shouldReplace) {
>
> where existing column, text, was replaced.
>
> On Thu, Mar 31, 2016 at 12:08 PM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> Just ran into the following. Is this a bug?
>>
>> scala> df.join(nums, df("id") === nums("id")).withColumn("TEXT2",
>> lit(5)).show
>> +---+---+---+-+-+
>> | id|   text| id| text|TEXT2|
>> +---+---+---+-+-+
>> |  0|  hello|  0|  two|5|
>> |  1|swiecie|  1|three|5|
>> +---+---+---+-+-+
>>
>>
>> scala> df.join(nums, df("id") === nums("id")).withColumn("TEXT",
>> lit(5)).show
>> +---++---++
>> | id|TEXT| id|TEXT|
>> +---++---++
>> |  0|   5|  0|   5|
>> |  1|   5|  1|   5|
>> +---++---++
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [SQL] A bug with withColumn?

2016-03-31 Thread Ted Yu
Looks like this is result of the following check:

val shouldReplace = output.exists(f => resolver(f.name, colName))
if (shouldReplace) {

where existing column, text, was replaced.

On Thu, Mar 31, 2016 at 12:08 PM, Jacek Laskowski  wrote:

> Hi,
>
> Just ran into the following. Is this a bug?
>
> scala> df.join(nums, df("id") === nums("id")).withColumn("TEXT2",
> lit(5)).show
> +---+---+---+-+-+
> | id|   text| id| text|TEXT2|
> +---+---+---+-+-+
> |  0|  hello|  0|  two|5|
> |  1|swiecie|  1|three|5|
> +---+---+---+-+-+
>
>
> scala> df.join(nums, df("id") === nums("id")).withColumn("TEXT",
> lit(5)).show
> +---++---++
> | id|TEXT| id|TEXT|
> +---++---++
> |  0|   5|  0|   5|
> |  1|   5|  1|   5|
> +---++---++
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark process creating and writing to a Hive ORC table

2016-03-31 Thread Ashok Kumar
Hello,
How feasible is to use Spark to extract csv files and creates and writes the 
content to an ORC table in a Hive database.
Is Parquet file the best (optimum) format to write to HDFS from Spark app.
Thanks

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Ted Yu
Can you show the stack trace ?

The log message came from
DiskBlockObjectWriter#revertPartialWritesAndClose().
Unfortunately, the method doesn't throw exception, making it a bit hard for
caller to know of the disk full condition.

On Thu, Mar 31, 2016 at 11:32 AM, Abhishek Anand 
wrote:

>
> Hi,
>
> Why is it so that when my disk space is full on one of the workers then
> the executor on that worker becomes unresponsive and the jobs on that
> worker fails with the exception
>
>
> 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while
> reverting partial writes to file
> /data/spark-e2fc248f-a212-4a99-9d6c-4e52d6a69070/executor-37679a6c-cb96-451e-a284-64d6b4fe9910/blockmgr-f8ca72f4-f329-468b-8e65-ef97f8fb285c/38/temp_shuffle_8f266d70-3fc6-41e5-bbaa-c413a7b08ea4
> java.io.IOException: No space left on device
>
>
> This is leading to my job getting stuck.
>
> As a workaround I have to kill the executor, clear the space on disk and
> new executor  relaunched by the worker and the failed stages are recomputed.
>
>
> How can I get rid of this problem i.e why my job get stuck on disk full
> issue on one of the workers ?
>
>
> Cheers !!!
> Abhi
>
>


OutOfMemory with wide (289 column) dataframe

2016-03-31 Thread ludflu
I'm building a spark job against Spark 1.6.0 / EMR 4.4 in Scala.

I'm attempting to concat a bunch of dataframe columns then explode them into
new rows. (just using the built in concat and explode functions) Works great
in my unit test.

But I get out of memory issues when I run against my production data (25GB
of bzip2 compressed pipe delimited text split out into about 30 individual
files):

The exception I get is:

ExecutorLostFailure (executor 8 exited caused by one of the running tasks)
Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.

EMR has set spark.executor.memory = 5120M

Since I don't really mind swapping to disk, I tried fiddling with the memory
settings:

  .set("spark.memory.fraction","0.15")
  .set("spark.memory.storageFraction","0.15")

Which does indeed help - but I still get a number of OutOfMemory problems
that kill executors.

I have some very wide DataFrames comprised of 289 string columns once all
joined together.
I used the SizeEstimator to get a read on the size of a single row and it
clocked in at 2.4 MB which seems like alot!

Is this totally unreasonable? If this were a local process I'd know how to
profile it. But I don't know how to memory profile a spark cluster. What
should I do?

One thought that occurs to me is that I can extract just the columns the I
need for the transformation and keep the rest of the columns in one big
string column. Then when the transformation is done, I can put them back
together. Is this a good idea, or is there a better way around the memory
issue?

One further thought is that quite alot of the columns are empty strings. But
it doesn't seem to make a difference to the size of the row when calculated
with SizeEstimator. 

I thought that if I could just insert the data as nulls instead of empty
strings, I could save space. so I made the fields Option[String] instead and
used None for the empty strings. The resulting Row was an even larger 2.6
MB? 

Any words of wisdom would be really appreciated!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-with-wide-289-column-dataframe-tp26651.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problem with jackson lib running on spark

2016-03-31 Thread Marcelo Oikawa
> Please exclude jackson-databind - that was where the AnnotationMap class
> comes from.
>

I tried as you suggest but i getting the same error. Seems strange because
when I see the generated jar there is nothing related as AnnotationMap but
there is a databind there.


​


>
> On Thu, Mar 31, 2016 at 11:37 AM, Marcelo Oikawa <
> marcelo.oik...@webradar.com> wrote:
>
>> Hi, Alonso.
>>
>> As you can see jackson-core is provided by several libraries, try to
>>> exclude it from spark-core, i think the minor version is included within
>>> it.
>>>
>>
>> There is no more than one jackson-core provides by spark-core. There are
>> jackson-core and jackson-core-asl but are differents artifacts. BTW, I
>> tried to exclude then but no sucess. Same error:
>>
>> java.lang.IllegalAccessError: tried to access method
>> com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap;
>> from class
>> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
>> at
>> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39)
>> at
>> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269)
>> at
>> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433)
>> ...
>>
>> I guess the problem is incopabilities between jackson artifacts that
>> comes from tranquility dependency vs spark prodided but I also tried to
>> find same jackson artifacts but in different versions but there is no one.
>> What is missing?
>>
>>
>> Use this guide to see how to do it:
>>>
>>>
>>> https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html
>>>
>>>
>>>
>>> Alonso Isidoro Roman.
>>>
>>> Mis citas preferidas (de hoy) :
>>> "Si depurar es el proceso de quitar los errores de software, entonces
>>> programar debe ser el proceso de introducirlos..."
>>>  -  Edsger Dijkstra
>>>
>>> My favorite quotes (today):
>>> "If debugging is the process of removing software bugs, then programming
>>> must be the process of putting ..."
>>>   - Edsger Dijkstra
>>>
>>> "If you pay peanuts you get monkeys"
>>>
>>>
>>> 2016-03-31 20:01 GMT+02:00 Marcelo Oikawa :
>>>
 Hey, Alonso.

 here is the output:

 [INFO] spark-processor:spark-processor-druid:jar:1.0-SNAPSHOT
 [INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.6.1:provided
 [INFO] |  +- org.apache.spark:spark-core_2.10:jar:1.6.1:provided
 [INFO] |  |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
 [INFO] |  |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
 [INFO] |  |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
 [INFO] |  |  +- com.twitter:chill_2.10:jar:0.5.0:provided
 [INFO] |  |  |  \- com.esotericsoftware.kryo:kryo:jar:2.21:provided
 [INFO] |  |  | +-
 com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:provided
 [INFO] |  |  | +-
 com.esotericsoftware.minlog:minlog:jar:1.2:provided
 [INFO] |  |  | \- org.objenesis:objenesis:jar:1.2:provided
 [INFO] |  |  +- com.twitter:chill-java:jar:0.5.0:provided
 [INFO] |  |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
 [INFO] |  |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:provided
 [INFO] |  |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:provided
 [INFO] |  |  |  |  +- org.apache.commons:commons-math:jar:2.1:provided
 [INFO] |  |  |  |  +- xmlenc:xmlenc:jar:0.52:provided
 [INFO] |  |  |  |  +-
 commons-configuration:commons-configuration:jar:1.6:provided
 [INFO] |  |  |  |  |  +-
 commons-digester:commons-digester:jar:1.8:provided
 [INFO] |  |  |  |  |  |  \-
 commons-beanutils:commons-beanutils:jar:1.7.0:provided
 [INFO] |  |  |  |  |  \-
 commons-beanutils:commons-beanutils-core:jar:1.8.0:provided
 [INFO] |  |  |  |  \- org.apache.hadoop:hadoop-auth:jar:2.2.0:provided
 [INFO] |  |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:provided
 [INFO] |  |  |  |  \- org.mortbay.jetty:jetty-util:jar:6.1.26:provided
 [INFO] |  |  |  +-
 org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:provided
 [INFO] |  |  |  |  +-
 org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:provided
 [INFO] |  |  |  |  |  +-
 org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:provided
 [INFO] |  |  |  |  |  |  +-
 com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:provided
 [INFO] |  |  |  |  |  |  |  +-
 com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:provided
 [INFO] |  |  |  |  |  |  |  |  \-
 com.sun.jersey:jersey-client:jar:1.9:provided
 [INFO] |  |  |  |  |  |  |  \-
 

[SQL] A bug with withColumn?

2016-03-31 Thread Jacek Laskowski
Hi,

Just ran into the following. Is this a bug?

scala> df.join(nums, df("id") === nums("id")).withColumn("TEXT2", lit(5)).show
+---+---+---+-+-+
| id|   text| id| text|TEXT2|
+---+---+---+-+-+
|  0|  hello|  0|  two|5|
|  1|swiecie|  1|three|5|
+---+---+---+-+-+


scala> df.join(nums, df("id") === nums("id")).withColumn("TEXT", lit(5)).show
+---++---++
| id|TEXT| id|TEXT|
+---++---++
|  0|   5|  0|   5|
|  1|   5|  1|   5|
+---++---++

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problem with jackson lib running on spark

2016-03-31 Thread Ted Yu
Please exclude jackson-databind - that was where the AnnotationMap class
comes from.

On Thu, Mar 31, 2016 at 11:37 AM, Marcelo Oikawa <
marcelo.oik...@webradar.com> wrote:

> Hi, Alonso.
>
> As you can see jackson-core is provided by several libraries, try to
>> exclude it from spark-core, i think the minor version is included within
>> it.
>>
>
> There is no more than one jackson-core provides by spark-core. There are
> jackson-core and jackson-core-asl but are differents artifacts. BTW, I
> tried to exclude then but no sucess. Same error:
>
> java.lang.IllegalAccessError: tried to access method
> com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap;
> from class
> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
> at
> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39)
> at
> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269)
> at
> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433)
> ...
>
> I guess the problem is incopabilities between jackson artifacts that comes
> from tranquility dependency vs spark prodided but I also tried to find same
> jackson artifacts but in different versions but there is no one. What is
> missing?
>
>
> Use this guide to see how to do it:
>>
>>
>> https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html
>>
>>
>>
>> Alonso Isidoro Roman.
>>
>> Mis citas preferidas (de hoy) :
>> "Si depurar es el proceso de quitar los errores de software, entonces
>> programar debe ser el proceso de introducirlos..."
>>  -  Edsger Dijkstra
>>
>> My favorite quotes (today):
>> "If debugging is the process of removing software bugs, then programming
>> must be the process of putting ..."
>>   - Edsger Dijkstra
>>
>> "If you pay peanuts you get monkeys"
>>
>>
>> 2016-03-31 20:01 GMT+02:00 Marcelo Oikawa :
>>
>>> Hey, Alonso.
>>>
>>> here is the output:
>>>
>>> [INFO] spark-processor:spark-processor-druid:jar:1.0-SNAPSHOT
>>> [INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.6.1:provided
>>> [INFO] |  +- org.apache.spark:spark-core_2.10:jar:1.6.1:provided
>>> [INFO] |  |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
>>> [INFO] |  |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
>>> [INFO] |  |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
>>> [INFO] |  |  +- com.twitter:chill_2.10:jar:0.5.0:provided
>>> [INFO] |  |  |  \- com.esotericsoftware.kryo:kryo:jar:2.21:provided
>>> [INFO] |  |  | +-
>>> com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:provided
>>> [INFO] |  |  | +- com.esotericsoftware.minlog:minlog:jar:1.2:provided
>>> [INFO] |  |  | \- org.objenesis:objenesis:jar:1.2:provided
>>> [INFO] |  |  +- com.twitter:chill-java:jar:0.5.0:provided
>>> [INFO] |  |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
>>> [INFO] |  |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:provided
>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:provided
>>> [INFO] |  |  |  |  +- org.apache.commons:commons-math:jar:2.1:provided
>>> [INFO] |  |  |  |  +- xmlenc:xmlenc:jar:0.52:provided
>>> [INFO] |  |  |  |  +-
>>> commons-configuration:commons-configuration:jar:1.6:provided
>>> [INFO] |  |  |  |  |  +-
>>> commons-digester:commons-digester:jar:1.8:provided
>>> [INFO] |  |  |  |  |  |  \-
>>> commons-beanutils:commons-beanutils:jar:1.7.0:provided
>>> [INFO] |  |  |  |  |  \-
>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:provided
>>> [INFO] |  |  |  |  \- org.apache.hadoop:hadoop-auth:jar:2.2.0:provided
>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:provided
>>> [INFO] |  |  |  |  \- org.mortbay.jetty:jetty-util:jar:6.1.26:provided
>>> [INFO] |  |  |  +-
>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:provided
>>> [INFO] |  |  |  |  +-
>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:provided
>>> [INFO] |  |  |  |  |  +-
>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:provided
>>> [INFO] |  |  |  |  |  |  +-
>>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:provided
>>> [INFO] |  |  |  |  |  |  |  +-
>>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:provided
>>> [INFO] |  |  |  |  |  |  |  |  \-
>>> com.sun.jersey:jersey-client:jar:1.9:provided
>>> [INFO] |  |  |  |  |  |  |  \-
>>> com.sun.jersey:jersey-grizzly2:jar:1.9:provided
>>> [INFO] |  |  |  |  |  |  | +-
>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:provided
>>> [INFO] |  |  |  |  |  |  | |  \-
>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:provided
>>> [INFO] |  |  |  |  |  |  | | \-
>>> 

Re: Problem with jackson lib running on spark

2016-03-31 Thread Marcelo Oikawa
Hi, Alonso.

As you can see jackson-core is provided by several libraries, try to
> exclude it from spark-core, i think the minor version is included within
> it.
>

There is no more than one jackson-core provides by spark-core. There are
jackson-core and jackson-core-asl but are differents artifacts. BTW, I
tried to exclude then but no sucess. Same error:

java.lang.IllegalAccessError: tried to access method
com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap;
from class
com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
at
com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39)
at
com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269)
at
com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433)
...

I guess the problem is incopabilities between jackson artifacts that comes
from tranquility dependency vs spark prodided but I also tried to find same
jackson artifacts but in different versions but there is no one. What is
missing?


Use this guide to see how to do it:
>
>
> https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html
>
>
>
> Alonso Isidoro Roman.
>
> Mis citas preferidas (de hoy) :
> "Si depurar es el proceso de quitar los errores de software, entonces
> programar debe ser el proceso de introducirlos..."
>  -  Edsger Dijkstra
>
> My favorite quotes (today):
> "If debugging is the process of removing software bugs, then programming
> must be the process of putting ..."
>   - Edsger Dijkstra
>
> "If you pay peanuts you get monkeys"
>
>
> 2016-03-31 20:01 GMT+02:00 Marcelo Oikawa :
>
>> Hey, Alonso.
>>
>> here is the output:
>>
>> [INFO] spark-processor:spark-processor-druid:jar:1.0-SNAPSHOT
>> [INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.6.1:provided
>> [INFO] |  +- org.apache.spark:spark-core_2.10:jar:1.6.1:provided
>> [INFO] |  |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
>> [INFO] |  |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
>> [INFO] |  |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
>> [INFO] |  |  +- com.twitter:chill_2.10:jar:0.5.0:provided
>> [INFO] |  |  |  \- com.esotericsoftware.kryo:kryo:jar:2.21:provided
>> [INFO] |  |  | +-
>> com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:provided
>> [INFO] |  |  | +- com.esotericsoftware.minlog:minlog:jar:1.2:provided
>> [INFO] |  |  | \- org.objenesis:objenesis:jar:1.2:provided
>> [INFO] |  |  +- com.twitter:chill-java:jar:0.5.0:provided
>> [INFO] |  |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
>> [INFO] |  |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:provided
>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:provided
>> [INFO] |  |  |  |  +- org.apache.commons:commons-math:jar:2.1:provided
>> [INFO] |  |  |  |  +- xmlenc:xmlenc:jar:0.52:provided
>> [INFO] |  |  |  |  +-
>> commons-configuration:commons-configuration:jar:1.6:provided
>> [INFO] |  |  |  |  |  +-
>> commons-digester:commons-digester:jar:1.8:provided
>> [INFO] |  |  |  |  |  |  \-
>> commons-beanutils:commons-beanutils:jar:1.7.0:provided
>> [INFO] |  |  |  |  |  \-
>> commons-beanutils:commons-beanutils-core:jar:1.8.0:provided
>> [INFO] |  |  |  |  \- org.apache.hadoop:hadoop-auth:jar:2.2.0:provided
>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:provided
>> [INFO] |  |  |  |  \- org.mortbay.jetty:jetty-util:jar:6.1.26:provided
>> [INFO] |  |  |  +-
>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:provided
>> [INFO] |  |  |  |  +-
>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:provided
>> [INFO] |  |  |  |  |  +-
>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:provided
>> [INFO] |  |  |  |  |  |  +-
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:provided
>> [INFO] |  |  |  |  |  |  |  +-
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:provided
>> [INFO] |  |  |  |  |  |  |  |  \-
>> com.sun.jersey:jersey-client:jar:1.9:provided
>> [INFO] |  |  |  |  |  |  |  \-
>> com.sun.jersey:jersey-grizzly2:jar:1.9:provided
>> [INFO] |  |  |  |  |  |  | +-
>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:provided
>> [INFO] |  |  |  |  |  |  | |  \-
>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:provided
>> [INFO] |  |  |  |  |  |  | | \-
>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:provided
>> [INFO] |  |  |  |  |  |  | |\-
>> org.glassfish.external:management-api:jar:3.0.0-b012:provided
>> [INFO] |  |  |  |  |  |  | +-
>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:provided
>> [INFO] |  |  |  |  |  |  | |  \-
>> 

Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Abhishek Anand
Hi,

Why is it so that when my disk space is full on one of the workers then the
executor on that worker becomes unresponsive and the jobs on that worker
fails with the exception


16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while
reverting partial writes to file
/data/spark-e2fc248f-a212-4a99-9d6c-4e52d6a69070/executor-37679a6c-cb96-451e-a284-64d6b4fe9910/blockmgr-f8ca72f4-f329-468b-8e65-ef97f8fb285c/38/temp_shuffle_8f266d70-3fc6-41e5-bbaa-c413a7b08ea4
java.io.IOException: No space left on device


This is leading to my job getting stuck.

As a workaround I have to kill the executor, clear the space on disk and
new executor  relaunched by the worker and the failed stages are recomputed.


How can I get rid of this problem i.e why my job get stuck on disk full
issue on one of the workers ?


Cheers !!!
Abhi


Re: Calling spark from a java web application.

2016-03-31 Thread Ricardo Paiva
$SPARK_HOME/conf/log4j.properties

It uses by default $SPARK_HOME/conf/log4j.properties.template

On Thu, Mar 31, 2016 at 3:28 PM, arul_anand_2000 [via Apache Spark User
List]  wrote:

> Can you please let me know how the log4j properties where configured. I am
> trying to integrate spark with web application. when spark context gets
> initialized, it overrides existing log4j properites.
>
> Regards,
> arul.
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Calling-spark-from-a-java-web-application-tp20007p26649.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Calling-spark-from-a-java-web-application-tp20007p26650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark master keeps running out of RAM

2016-03-31 Thread Josh Rosen
One possible cause of a standalone master OOMing is
https://issues.apache.org/jira/browse/SPARK-6270.  In 2.x, this will be
fixed by https://issues.apache.org/jira/browse/SPARK-12299. In 1.x, one
mitigation is to disable event logging. Another workaround would be to
produce a patch which disables eager UI reconstruction for 1.x.

On Thu, Mar 31, 2016 at 11:16 AM Dillian Murphey 
wrote:

> Why would the spark master run out of RAM if I have too  many slaves?  Is
> this a flaw in the coding?  I'm just a user of spark. The developer that
> set this up left the company, so I'm starting from the top here.
>
> So I noticed if I spawn lots of jobs, my spark master ends up crashing due
> to low memory.  It makes sense to me the master would just be a
> brain/controller to dish out jobs and the resources on the slaves is what
> would get used up, not the master.
>
> Thanks for any ideas/concepts/info.
>
> Appreciate it much
>
>


Spark master keeps running out of RAM

2016-03-31 Thread Dillian Murphey
Why would the spark master run out of RAM if I have too  many slaves?  Is
this a flaw in the coding?  I'm just a user of spark. The developer that
set this up left the company, so I'm starting from the top here.

So I noticed if I spawn lots of jobs, my spark master ends up crashing due
to low memory.  It makes sense to me the master would just be a
brain/controller to dish out jobs and the resources on the slaves is what
would get used up, not the master.

Thanks for any ideas/concepts/info.

Appreciate it much


Re: Problem with jackson lib running on spark

2016-03-31 Thread Alonso Isidoro Roman
As you can see jackson-core is provided by several libraries, try to
exclude it from spark-core, i think the minor version is included within
it.

Use this guide to see how to do it:

https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html



Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-03-31 20:01 GMT+02:00 Marcelo Oikawa :

> Hey, Alonso.
>
> here is the output:
>
> [INFO] spark-processor:spark-processor-druid:jar:1.0-SNAPSHOT
> [INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.6.1:provided
> [INFO] |  +- org.apache.spark:spark-core_2.10:jar:1.6.1:provided
> [INFO] |  |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
> [INFO] |  |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
> [INFO] |  |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
> [INFO] |  |  +- com.twitter:chill_2.10:jar:0.5.0:provided
> [INFO] |  |  |  \- com.esotericsoftware.kryo:kryo:jar:2.21:provided
> [INFO] |  |  | +-
> com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:provided
> [INFO] |  |  | +- com.esotericsoftware.minlog:minlog:jar:1.2:provided
> [INFO] |  |  | \- org.objenesis:objenesis:jar:1.2:provided
> [INFO] |  |  +- com.twitter:chill-java:jar:0.5.0:provided
> [INFO] |  |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
> [INFO] |  |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:provided
> [INFO] |  |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:provided
> [INFO] |  |  |  |  +- org.apache.commons:commons-math:jar:2.1:provided
> [INFO] |  |  |  |  +- xmlenc:xmlenc:jar:0.52:provided
> [INFO] |  |  |  |  +-
> commons-configuration:commons-configuration:jar:1.6:provided
> [INFO] |  |  |  |  |  +- commons-digester:commons-digester:jar:1.8:provided
> [INFO] |  |  |  |  |  |  \-
> commons-beanutils:commons-beanutils:jar:1.7.0:provided
> [INFO] |  |  |  |  |  \-
> commons-beanutils:commons-beanutils-core:jar:1.8.0:provided
> [INFO] |  |  |  |  \- org.apache.hadoop:hadoop-auth:jar:2.2.0:provided
> [INFO] |  |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:provided
> [INFO] |  |  |  |  \- org.mortbay.jetty:jetty-util:jar:6.1.26:provided
> [INFO] |  |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:provided
> [INFO] |  |  |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:provided
> [INFO] |  |  |  |  |  +-
> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:provided
> [INFO] |  |  |  |  |  |  +-
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:provided
> [INFO] |  |  |  |  |  |  |  +-
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:provided
> [INFO] |  |  |  |  |  |  |  |  \-
> com.sun.jersey:jersey-client:jar:1.9:provided
> [INFO] |  |  |  |  |  |  |  \-
> com.sun.jersey:jersey-grizzly2:jar:1.9:provided
> [INFO] |  |  |  |  |  |  | +-
> org.glassfish.grizzly:grizzly-http:jar:2.1.2:provided
> [INFO] |  |  |  |  |  |  | |  \-
> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:provided
> [INFO] |  |  |  |  |  |  | | \-
> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:provided
> [INFO] |  |  |  |  |  |  | |\-
> org.glassfish.external:management-api:jar:3.0.0-b012:provided
> [INFO] |  |  |  |  |  |  | +-
> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:provided
> [INFO] |  |  |  |  |  |  | |  \-
> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:provided
> [INFO] |  |  |  |  |  |  | +-
> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:provided
> [INFO] |  |  |  |  |  |  | \-
> org.glassfish:javax.servlet:jar:3.1:provided
> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-json:jar:1.9:provided
> [INFO] |  |  |  |  |  | +-
> org.codehaus.jettison:jettison:jar:1.1:provided
> [INFO] |  |  |  |  |  | |  \- stax:stax-api:jar:1.0.1:provided
> [INFO] |  |  |  |  |  | +-
> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:provided
> [INFO] |  |  |  |  |  | |  \-
> javax.xml.bind:jaxb-api:jar:2.2.2:provided
> [INFO] |  |  |  |  |  | | \-
> javax.activation:activation:jar:1.1:provided
> [INFO] |  |  |  |  |  | +-
> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:provided
> [INFO] |  |  |  |  |  | \-
> org.codehaus.jackson:jackson-xc:jar:1.8.3:provided
> [INFO] |  |  |  |  |  \-
> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:provided
> [INFO] |  |  |  |  \-
> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:provided
> [INFO] |  |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:provided
> [INFO] |  |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:provided
> [INFO] |  |  |  |  

Re: Problem with jackson lib running on spark

2016-03-31 Thread Marcelo Oikawa
Hey, Alonso.

here is the output:

[INFO] spark-processor:spark-processor-druid:jar:1.0-SNAPSHOT
[INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.6.1:provided
[INFO] |  +- org.apache.spark:spark-core_2.10:jar:1.6.1:provided
[INFO] |  |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
[INFO] |  |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
[INFO] |  |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
[INFO] |  |  +- com.twitter:chill_2.10:jar:0.5.0:provided
[INFO] |  |  |  \- com.esotericsoftware.kryo:kryo:jar:2.21:provided
[INFO] |  |  | +-
com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:provided
[INFO] |  |  | +- com.esotericsoftware.minlog:minlog:jar:1.2:provided
[INFO] |  |  | \- org.objenesis:objenesis:jar:1.2:provided
[INFO] |  |  +- com.twitter:chill-java:jar:0.5.0:provided
[INFO] |  |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
[INFO] |  |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:provided
[INFO] |  |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:provided
[INFO] |  |  |  |  +- org.apache.commons:commons-math:jar:2.1:provided
[INFO] |  |  |  |  +- xmlenc:xmlenc:jar:0.52:provided
[INFO] |  |  |  |  +-
commons-configuration:commons-configuration:jar:1.6:provided
[INFO] |  |  |  |  |  +- commons-digester:commons-digester:jar:1.8:provided
[INFO] |  |  |  |  |  |  \-
commons-beanutils:commons-beanutils:jar:1.7.0:provided
[INFO] |  |  |  |  |  \-
commons-beanutils:commons-beanutils-core:jar:1.8.0:provided
[INFO] |  |  |  |  \- org.apache.hadoop:hadoop-auth:jar:2.2.0:provided
[INFO] |  |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:provided
[INFO] |  |  |  |  \- org.mortbay.jetty:jetty-util:jar:6.1.26:provided
[INFO] |  |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:provided
[INFO] |  |  |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:provided
[INFO] |  |  |  |  |  +-
org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:provided
[INFO] |  |  |  |  |  |  +-
com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:provided
[INFO] |  |  |  |  |  |  |  +-
com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:provided
[INFO] |  |  |  |  |  |  |  |  \-
com.sun.jersey:jersey-client:jar:1.9:provided
[INFO] |  |  |  |  |  |  |  \-
com.sun.jersey:jersey-grizzly2:jar:1.9:provided
[INFO] |  |  |  |  |  |  | +-
org.glassfish.grizzly:grizzly-http:jar:2.1.2:provided
[INFO] |  |  |  |  |  |  | |  \-
org.glassfish.grizzly:grizzly-framework:jar:2.1.2:provided
[INFO] |  |  |  |  |  |  | | \-
org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:provided
[INFO] |  |  |  |  |  |  | |\-
org.glassfish.external:management-api:jar:3.0.0-b012:provided
[INFO] |  |  |  |  |  |  | +-
org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:provided
[INFO] |  |  |  |  |  |  | |  \-
org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:provided
[INFO] |  |  |  |  |  |  | +-
org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:provided
[INFO] |  |  |  |  |  |  | \-
org.glassfish:javax.servlet:jar:3.1:provided
[INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-json:jar:1.9:provided
[INFO] |  |  |  |  |  | +-
org.codehaus.jettison:jettison:jar:1.1:provided
[INFO] |  |  |  |  |  | |  \- stax:stax-api:jar:1.0.1:provided
[INFO] |  |  |  |  |  | +-
com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:provided
[INFO] |  |  |  |  |  | |  \- javax.xml.bind:jaxb-api:jar:2.2.2:provided
[INFO] |  |  |  |  |  | | \-
javax.activation:activation:jar:1.1:provided
[INFO] |  |  |  |  |  | +-
org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:provided
[INFO] |  |  |  |  |  | \-
org.codehaus.jackson:jackson-xc:jar:1.8.3:provided
[INFO] |  |  |  |  |  \-
org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:provided
[INFO] |  |  |  |  \-
org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:provided
[INFO] |  |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:provided
[INFO] |  |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:provided
[INFO] |  |  |  |  \-
org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:provided
[INFO] |  |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:provided
[INFO] |  |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:provided
[INFO] |  |  +- org.apache.spark:spark-launcher_2.10:jar:1.6.1:provided
[INFO] |  |  +-
org.apache.spark:spark-network-common_2.10:jar:1.6.1:provided
[INFO] |  |  +-
org.apache.spark:spark-network-shuffle_2.10:jar:1.6.1:provided
[INFO] |  |  |  \- org.fusesource.leveldbjni:leveldbjni-all:jar:1.8:provided
[INFO] |  |  +- org.apache.spark:spark-unsafe_2.10:jar:1.6.1:provided
[INFO] |  |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:provided
[INFO] |  |  |  \- commons-httpclient:commons-httpclient:jar:3.1:provided
[INFO] |  |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
[INFO] |  |  +-
org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:provided
[INFO] |  |  +- 

Re: Problem with jackson lib running on spark

2016-03-31 Thread Alonso Isidoro Roman
Run mvn dependency:tree and print the output here, i suspect that jackson
library is included within more than one dependency.

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-03-31 19:21 GMT+02:00 Marcelo Oikawa :

> Hey, Ted
>
> 2.4.4
>>
>> Looks like Tranquility uses different version of jackson.
>>
>> How do you build your jar ?
>>
>
> I'm building a jar with dependencies using the maven assembly plugin.
> Below is all jackson's dependencies:
>
> [INFO]
> com.fasterxml.jackson.module:jackson-module-scala_2.10:jar:2.4.5:compile
> [INFO]com.fasterxml.jackson.core:jackson-databind:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.datatype:jackson-datatype-joda:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.dataformat:jackson-dataformat-smile:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-smile-provider:jar:2.4.6:compile
> [INFO]com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.module:jackson-module-jaxb-annotations:jar:2.4.6:compile
> [INFO]com.fasterxml.jackson.core:jackson-core:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:jar:2.4.6:compile
> [INFO]
> com.fasterxml.jackson.datatype:jackson-datatype-guava:jar:2.4.6:compile
> [INFO]com.fasterxml.jackson.core:jackson-annotations:jar:2.4.6:compile
> [INFO]org.json4s:json4s-jackson_2.10:jar:3.2.10:provided
> [INFO]org.codehaus.jackson:jackson-xc:jar:1.8.3:provided
> [INFO]org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:provided
> [INFO]org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
> [INFO]org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
> [INFO]
> com.google.http-client:google-http-client-jackson2:jar:1.15.0-rc:compile
>
> As you can see, my jar requires fasterxml 2.4.6. In that case, what does
> spark do? Does it run my jar with my jackson lib (inside my jar) or uses
> the jackson version (2.4.4) used by spark?
>
> Note that one of my dependencies is:
>
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.1
> provided
> 
>
> and the jackson version 2.4.4 was not listed in maven dependencies...
>
>
>>
>> Consider using maven-shade-plugin to resolve the conflict if you use
>> maven.
>>
>> Cheers
>>
>> On Thu, Mar 31, 2016 at 9:50 AM, Marcelo Oikawa <
>> marcelo.oik...@webradar.com> wrote:
>>
>>> Hi, list.
>>>
>>> We are working on a spark application that sends messages to Druid. For
>>> that, we're using Tranquility core. In my local test, I'm using the
>>> "spark-1.6.1-bin-hadoop2.6" distribution and the following dependencies in
>>> my app:
>>>
>>> 
>>> org.apache.spark
>>> spark-streaming_2.10
>>> 1.6.1
>>> provided
>>> 
>>> 
>>> io.druid
>>> tranquility-core_2.10
>>> 0.7.4
>>> 
>>>
>>> But i getting the error down below when Tranquility tries to create
>>> Tranquilizer object:
>>>
>>> tranquilizer = 
>>> DruidBeams.fromConfig(dataSourceConfig).buildTranquilizer(tranquilizerBuider);
>>>
>>> The stacktrace is down below:
>>>
>>> java.lang.IllegalAccessError: tried to access method
>>> com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap;
>>> from class
>>> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
>>> at
>>> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39)
>>> at
>>> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269)
>>> at
>>> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433)
>>> at
>>> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._constructDefaultValueInstantiator(BasicDeserializerFactory.java:325)
>>> at
>>> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findValueInstantiator(BasicDeserializerFactory.java:266)
>>> at
>>> com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:266)
>>> at
>>> com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:168)
>>> at
>>> com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:399)
>>> at
>>> com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:348)
>>> at
>>> 

Re: Problem with jackson lib running on spark

2016-03-31 Thread Marcelo Oikawa
Hey, Ted

2.4.4
>
> Looks like Tranquility uses different version of jackson.
>
> How do you build your jar ?
>

I'm building a jar with dependencies using the maven assembly plugin. Below
is all jackson's dependencies:

[INFO]
com.fasterxml.jackson.module:jackson-module-scala_2.10:jar:2.4.5:compile
[INFO]com.fasterxml.jackson.core:jackson-databind:jar:2.4.6:compile
[INFO]
com.fasterxml.jackson.datatype:jackson-datatype-joda:jar:2.4.6:compile
[INFO]
com.fasterxml.jackson.dataformat:jackson-dataformat-smile:jar:2.4.6:compile
[INFO]
com.fasterxml.jackson.jaxrs:jackson-jaxrs-smile-provider:jar:2.4.6:compile
[INFO]com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:jar:2.4.6:compile
[INFO]
com.fasterxml.jackson.module:jackson-module-jaxb-annotations:jar:2.4.6:compile
[INFO]com.fasterxml.jackson.core:jackson-core:jar:2.4.6:compile
[INFO]
com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:jar:2.4.6:compile
[INFO]
com.fasterxml.jackson.datatype:jackson-datatype-guava:jar:2.4.6:compile
[INFO]com.fasterxml.jackson.core:jackson-annotations:jar:2.4.6:compile
[INFO]org.json4s:json4s-jackson_2.10:jar:3.2.10:provided
[INFO]org.codehaus.jackson:jackson-xc:jar:1.8.3:provided
[INFO]org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:provided
[INFO]org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
[INFO]org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
[INFO]
com.google.http-client:google-http-client-jackson2:jar:1.15.0-rc:compile

As you can see, my jar requires fasterxml 2.4.6. In that case, what does
spark do? Does it run my jar with my jackson lib (inside my jar) or uses
the jackson version (2.4.4) used by spark?

Note that one of my dependencies is:


org.apache.spark
spark-streaming_2.10
1.6.1
provided


and the jackson version 2.4.4 was not listed in maven dependencies...


>
> Consider using maven-shade-plugin to resolve the conflict if you use maven.
>
> Cheers
>
> On Thu, Mar 31, 2016 at 9:50 AM, Marcelo Oikawa <
> marcelo.oik...@webradar.com> wrote:
>
>> Hi, list.
>>
>> We are working on a spark application that sends messages to Druid. For
>> that, we're using Tranquility core. In my local test, I'm using the
>> "spark-1.6.1-bin-hadoop2.6" distribution and the following dependencies in
>> my app:
>>
>> 
>> org.apache.spark
>> spark-streaming_2.10
>> 1.6.1
>> provided
>> 
>> 
>> io.druid
>> tranquility-core_2.10
>> 0.7.4
>> 
>>
>> But i getting the error down below when Tranquility tries to create
>> Tranquilizer object:
>>
>> tranquilizer = 
>> DruidBeams.fromConfig(dataSourceConfig).buildTranquilizer(tranquilizerBuider);
>>
>> The stacktrace is down below:
>>
>> java.lang.IllegalAccessError: tried to access method
>> com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap;
>> from class
>> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
>> at
>> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39)
>> at
>> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269)
>> at
>> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433)
>> at
>> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._constructDefaultValueInstantiator(BasicDeserializerFactory.java:325)
>> at
>> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findValueInstantiator(BasicDeserializerFactory.java:266)
>> at
>> com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:266)
>> at
>> com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:168)
>> at
>> com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:399)
>> at
>> com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:348)
>> at
>> com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:261)
>> at
>> com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:241)
>> at
>> com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142)
>> at
>> com.fasterxml.jackson.databind.DeserializationContext.findContextualValueDeserializer(DeserializationContext.java:380)
>> at
>> com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.construct(PropertyBasedCreator.java:96)
>> at
>> com.fasterxml.jackson.databind.deser.BeanDeserializerBase.resolve(BeanDeserializerBase.java:413)
>> at
>> com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:292)

Re: transformation - spark vs cassandra

2016-03-31 Thread Femi Anthony
Try it out on a smaller subset of data and see which gives the better
performance.

On Thu, Mar 31, 2016 at 12:11 PM, Arun Sethia  wrote:

> Thanks Imre.
>
> But I thought spark-cassandra driver is going to do same internally.
>
> On Thu, Mar 31, 2016 at 10:32 AM, Imre Nagi 
> wrote:
>
>> I think querying by cassandra query language will be better in terms of
>> performance if you want to pull and filter the data from your db, rather
>> than pulling all of the data and do some filtering and transformation by
>> using spark data frame.
>>
>>
>> On 31 Mar 2016 22:19, "asethia"  wrote:
>>
>>> Hi,
>>>
>>> I am working with Cassandra and Spark, would like to know what is best
>>> performance using Cassandra filter based on primary key and cluster key
>>> vs
>>> using spark data frame transformation/filters.
>>>
>>> for example in spark:
>>>
>>>  val rdd = sqlContext.read.format("org.apache.spark.sql.cassandra")
>>>   .options(Map("keyspace" -> "test", "table" -> "test"))
>>>   .load()
>>>
>>> and then rdd.filter("cdate
>>> ='2016-06-07'").filter("country='USA'").count()
>>>
>>> vs
>>>
>>> using Cassandra (where cdate is part of primary key and country as
>>> cluster
>>> key).
>>>
>>> SELECT count(*) FROM test WHERE cdate ='2016-06-07' AND country='USA'
>>>
>>> I would like to know when should we use Cassandra simple query vs
>>> dataframe
>>> in terms of performance with billion of rows.
>>>
>>> Thanks
>>> arun
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/transformation-spark-vs-cassandra-tp26647.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: confusing about Spark SQL json format

2016-03-31 Thread Femi Anthony
I encountered a similar problem reading multi-line JSON files into Spark a
while back, and here's an article I wrote about how to solve it:

http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/

You may find it useful.

Femi

On Thu, Mar 31, 2016 at 12:32 PM,  wrote:

> You are correct that it does not take the standard JSON file format. From
> the Spark Docs:
> "Note that the file that is offered as *a json file* is not a typical
> JSON file. Each line must contain a separate, self-contained valid JSON
> object. As a consequence, a regular multi-line JSON file will most often
> fail.”
>
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>
> On Mar 31, 2016, at 5:30 AM, charles li  wrote:
>
> hi, UMESH, have you tried to load that json file on your machine? I did
> try it before, and here is the screenshot:
>
> <屏幕快照 2016-03-31 下午5.27.30.png>
> <屏幕快照 2016-03-31 下午5.27.39.png>
> ​
> ​
>
>
>
>
> On Thu, Mar 31, 2016 at 5:19 PM, UMESH CHAUDHARY 
> wrote:
>
>> Hi Charles,
>> The definition of object from www.json.org
>> 
>> :
>>
>> An *object* is an unordered set of name/value pairs. An object begins
>> with { (left brace) and ends with } (right brace). Each name is followed
>> by : (colon) and the name/value pairs are separated by , (comma).
>>
>> Its a pretty much OOPS paradigm , isn't it?
>>
>> Regards,
>> Umesh
>>
>> On Thu, Mar 31, 2016 at 2:34 PM, charles li 
>> wrote:
>>
>>> hi, UMESH, I think you've misunderstood the json definition.
>>>
>>> there is only one object in a json file:
>>>
>>>
>>> for the file, people.json, as bellow:
>>>
>>>
>>> 
>>>
>>> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
>>> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>>
>>>
>>> ---
>>>
>>> it does have two valid format:
>>>
>>> 1.
>>>
>>>
>>> 
>>>
>>> [ {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}},
>>> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>> ]
>>>
>>>
>>> ---
>>>
>>> 2.
>>>
>>>
>>> 
>>>
>>> {"name": ["Yin", "Michael"],
>>> "address":[ {"city":"Columbus","state":"Ohio"},
>>> {"city":null, "state":"California"} ]
>>> }
>>>
>>> ---
>>>
>>>
>>>
>>> On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY 
>>> wrote:
>>>
 Hi,
 Look at below image which is from json.org
 
 :

 

 The above image describes the object formulation of below JSON:

 Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
 Object=> {"name":"Michael", "address":{"city":null,
 "state":"California"}}


 Note that "address" is also an object.



 On Thu, Mar 31, 2016 at 1:53 PM, charles li 
 wrote:

> as this post  says, that in spark, we can load a json file in this way
> bellow:
>
> *post* :
> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
> 
>
>
>
> ---
> sqlContext.jsonFile(file_path)
> or
> sqlContext.read.json(file_path)
>
> ---
>
>
> and the *json file format* looks like bellow, say *people.json*
>
>
> {"name":"Yin",
> 

Re: Problem with jackson lib running on spark

2016-03-31 Thread Ted Yu
Spark 1.6.1 uses this version of jackson:

2.4.4

Looks like Tranquility uses different version of jackson.

How do you build your jar ?

Consider using maven-shade-plugin to resolve the conflict if you use maven.

Cheers

On Thu, Mar 31, 2016 at 9:50 AM, Marcelo Oikawa  wrote:

> Hi, list.
>
> We are working on a spark application that sends messages to Druid. For
> that, we're using Tranquility core. In my local test, I'm using the
> "spark-1.6.1-bin-hadoop2.6" distribution and the following dependencies in
> my app:
>
> 
> org.apache.spark
> spark-streaming_2.10
> 1.6.1
> provided
> 
> 
> io.druid
> tranquility-core_2.10
> 0.7.4
> 
>
> But i getting the error down below when Tranquility tries to create
> Tranquilizer object:
>
> tranquilizer = 
> DruidBeams.fromConfig(dataSourceConfig).buildTranquilizer(tranquilizerBuider);
>
> The stacktrace is down below:
>
> java.lang.IllegalAccessError: tried to access method
> com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap;
> from class
> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
> at
> com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39)
> at
> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269)
> at
> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433)
> at
> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._constructDefaultValueInstantiator(BasicDeserializerFactory.java:325)
> at
> com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findValueInstantiator(BasicDeserializerFactory.java:266)
> at
> com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:266)
> at
> com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:168)
> at
> com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:399)
> at
> com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:348)
> at
> com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:261)
> at
> com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:241)
> at
> com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142)
> at
> com.fasterxml.jackson.databind.DeserializationContext.findContextualValueDeserializer(DeserializationContext.java:380)
> at
> com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.construct(PropertyBasedCreator.java:96)
> at
> com.fasterxml.jackson.databind.deser.BeanDeserializerBase.resolve(BeanDeserializerBase.java:413)
> at
> com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:292)
> at
> com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:241)
> at
> com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142)
> at
> com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:394)
> at
> com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:3169)
> at
> com.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:2767)
> at
> com.fasterxml.jackson.databind.ObjectMapper.convertValue(ObjectMapper.java:2700)
> at
> com.metamx.tranquility.druid.DruidBeams$.fromConfigInternal(DruidBeams.scala:192)
> at
> com.metamx.tranquility.druid.DruidBeams$.fromConfig(DruidBeams.scala:119)
> at com.metamx.tranquility.druid.DruidBeams.fromConfig(DruidBeams.scala)
>
> Does someone faced that problem too?
>
> I know that it's related to jackson lib conflict but could anyone please
> shed some light? I created a jar with dependencies and when I submit a job
> for spark, does it run with just with the libraries inside the jar, right?
> Where is the conflict between jacksons libraries?
>


Problem with jackson lib running on spark

2016-03-31 Thread Marcelo Oikawa
Hi, list.

We are working on a spark application that sends messages to Druid. For
that, we're using Tranquility core. In my local test, I'm using the
"spark-1.6.1-bin-hadoop2.6" distribution and the following dependencies in
my app:


org.apache.spark
spark-streaming_2.10
1.6.1
provided


io.druid
tranquility-core_2.10
0.7.4


But i getting the error down below when Tranquility tries to create
Tranquilizer object:

tranquilizer = 
DruidBeams.fromConfig(dataSourceConfig).buildTranquilizer(tranquilizerBuider);

The stacktrace is down below:

java.lang.IllegalAccessError: tried to access method
com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap;
from class
com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
at
com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39)
at
com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269)
at
com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433)
at
com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._constructDefaultValueInstantiator(BasicDeserializerFactory.java:325)
at
com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.findValueInstantiator(BasicDeserializerFactory.java:266)
at
com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:266)
at
com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:168)
at
com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:399)
at
com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:348)
at
com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:261)
at
com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:241)
at
com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142)
at
com.fasterxml.jackson.databind.DeserializationContext.findContextualValueDeserializer(DeserializationContext.java:380)
at
com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.construct(PropertyBasedCreator.java:96)
at
com.fasterxml.jackson.databind.deser.BeanDeserializerBase.resolve(BeanDeserializerBase.java:413)
at
com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:292)
at
com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:241)
at
com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142)
at
com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:394)
at
com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:3169)
at
com.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:2767)
at
com.fasterxml.jackson.databind.ObjectMapper.convertValue(ObjectMapper.java:2700)
at
com.metamx.tranquility.druid.DruidBeams$.fromConfigInternal(DruidBeams.scala:192)
at
com.metamx.tranquility.druid.DruidBeams$.fromConfig(DruidBeams.scala:119)
at com.metamx.tranquility.druid.DruidBeams.fromConfig(DruidBeams.scala)

Does someone faced that problem too?

I know that it's related to jackson lib conflict but could anyone please
shed some light? I created a jar with dependencies and when I submit a job
for spark, does it run with just with the libraries inside the jar, right?
Where is the conflict between jacksons libraries?


Re: confusing about Spark SQL json format

2016-03-31 Thread Ross.Cramblit
You are correct that it does not take the standard JSON file format. From the 
Spark Docs:
"Note that the file that is offered as a json file is not a typical JSON file. 
Each line must contain a separate, self-contained valid JSON object. As a 
consequence, a regular multi-line JSON file will most often fail.”

http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

On Mar 31, 2016, at 5:30 AM, charles li 
> wrote:

hi, UMESH, have you tried to load that json file on your machine? I did try it 
before, and here is the screenshot:

<屏幕快照 2016-03-31 下午5.27.30.png>
<屏幕快照 2016-03-31 下午5.27.39.png>
​
​




On Thu, Mar 31, 2016 at 5:19 PM, UMESH CHAUDHARY 
> wrote:
Hi Charles,
The definition of object from 
www.json.org:

An object is an unordered set of name/value pairs. An object begins with { 
(left brace) and ends with } (right brace). Each name is followed by : (colon) 
and the name/value pairs are separated by , (comma).

Its a pretty much OOPS paradigm , isn't it?

Regards,
Umesh

On Thu, Mar 31, 2016 at 2:34 PM, charles li 
> wrote:
hi, UMESH, I think you've misunderstood the json definition.

there is only one object in a json file:


for the file, people.json, as bellow:



{"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}

---

it does have two valid format:

1.



[ {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}},
{"name":"Michael", "address":{"city":null, "state":"California"}}
]

---

2.



{"name": ["Yin", "Michael"],
"address":[ {"city":"Columbus","state":"Ohio"},
{"city":null, "state":"California"} ]
}
---



On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY 
> wrote:
Hi,
Look at below image which is from 
json.org
 :



The above image describes the object formulation of below JSON:

Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
Object=> {"name":"Michael", "address":{"city":null, "state":"California"}}


Note that "address" is also an object.



On Thu, Mar 31, 2016 at 1:53 PM, charles li 
> wrote:
as this post  says, that in spark, we can load a json file in this way bellow:

post : 
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html


---
sqlContext.jsonFile(file_path)
or
sqlContext.read.json(file_path)
---


and the json file format looks like bellow, say people.json

{"name":"Yin",
 "address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}
---


and here comes my problems:

Is that the standard json format? according to 
http://www.json.org/
 , I don't think so. it's just a collection of records [ a dict ], not a valid 
json format. as the json 

Re: transformation - spark vs cassandra

2016-03-31 Thread Arun Sethia
Thanks Imre.

But I thought spark-cassandra driver is going to do same internally.

On Thu, Mar 31, 2016 at 10:32 AM, Imre Nagi  wrote:

> I think querying by cassandra query language will be better in terms of
> performance if you want to pull and filter the data from your db, rather
> than pulling all of the data and do some filtering and transformation by
> using spark data frame.
>
>
> On 31 Mar 2016 22:19, "asethia"  wrote:
>
>> Hi,
>>
>> I am working with Cassandra and Spark, would like to know what is best
>> performance using Cassandra filter based on primary key and cluster key vs
>> using spark data frame transformation/filters.
>>
>> for example in spark:
>>
>>  val rdd = sqlContext.read.format("org.apache.spark.sql.cassandra")
>>   .options(Map("keyspace" -> "test", "table" -> "test"))
>>   .load()
>>
>> and then rdd.filter("cdate ='2016-06-07'").filter("country='USA'").count()
>>
>> vs
>>
>> using Cassandra (where cdate is part of primary key and country as cluster
>> key).
>>
>> SELECT count(*) FROM test WHERE cdate ='2016-06-07' AND country='USA'
>>
>> I would like to know when should we use Cassandra simple query vs
>> dataframe
>> in terms of performance with billion of rows.
>>
>> Thanks
>> arun
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/transformation-spark-vs-cassandra-tp26647.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: Concurrent Spark jobs

2016-03-31 Thread emlyn
In case anyone else has the same problem and finds this - in my case it was
fixed by increasing spark.sql.broadcastTimeout (I used 9000).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011p26648.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: transformation - spark vs cassandra

2016-03-31 Thread Imre Nagi
I think querying by cassandra query language will be better in terms of
performance if you want to pull and filter the data from your db, rather
than pulling all of the data and do some filtering and transformation by
using spark data frame.


On 31 Mar 2016 22:19, "asethia"  wrote:

> Hi,
>
> I am working with Cassandra and Spark, would like to know what is best
> performance using Cassandra filter based on primary key and cluster key vs
> using spark data frame transformation/filters.
>
> for example in spark:
>
>  val rdd = sqlContext.read.format("org.apache.spark.sql.cassandra")
>   .options(Map("keyspace" -> "test", "table" -> "test"))
>   .load()
>
> and then rdd.filter("cdate ='2016-06-07'").filter("country='USA'").count()
>
> vs
>
> using Cassandra (where cdate is part of primary key and country as cluster
> key).
>
> SELECT count(*) FROM test WHERE cdate ='2016-06-07' AND country='USA'
>
> I would like to know when should we use Cassandra simple query vs dataframe
> in terms of performance with billion of rows.
>
> Thanks
> arun
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/transformation-spark-vs-cassandra-tp26647.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


transformation - spark vs cassandra

2016-03-31 Thread asethia
Hi,

I am working with Cassandra and Spark, would like to know what is best
performance using Cassandra filter based on primary key and cluster key vs
using spark data frame transformation/filters.

for example in spark:

 val rdd = sqlContext.read.format("org.apache.spark.sql.cassandra")
  .options(Map("keyspace" -> "test", "table" -> "test"))
  .load()

and then rdd.filter("cdate ='2016-06-07'").filter("country='USA'").count()

vs

using Cassandra (where cdate is part of primary key and country as cluster
key).

SELECT count(*) FROM test WHERE cdate ='2016-06-07' AND country='USA'

I would like to know when should we use Cassandra simple query vs dataframe
in terms of performance with billion of rows.

Thanks
arun 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/transformation-spark-vs-cassandra-tp26647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: SPARK-13900 - Join with simple OR conditions take too long

2016-03-31 Thread Yong Zhang
I agree that there won't be a generic solution for these kind of cases.
Without the CBO from Spark or Hadoop ecosystem in short future, maybe Spark 
DataFrame/SQL should support more hints from the end user, as in these cases, 
end users will be smart enough to tell the engine what is the correct way to do.
Weren't the relational DBs doing exactly same path? RBO -> RBO + Hints -> CBO?
Yong

Date: Thu, 31 Mar 2016 16:07:14 +0530
Subject: Re: SPARK-13900 - Join with simple OR conditions take too long
From: hemant9...@gmail.com
To: ashokkumar.rajend...@gmail.com
CC: user@spark.apache.org

Hi Ashok,

That's interesting. 

As I understand, on table A and B, a nested loop join (that will produce m X n 
rows) is performed and than each row is evaluated to see if any of the 
condition is met. You are asking that Spark should instead do a 
BroadcastHashJoin on the equality conditions in parallel and then union the 
results like you are doing in a different query. 

If we leave aside parallelism for a moment, theoretically, time taken for 
nested loop join would vary little when the number of conditions are increased 
while the time taken for the solution that you are suggesting would increase 
linearly with number of conditions. So, when number of conditions are too many, 
nested loop join would be faster than the solution that you suggest. Now the 
question is, how should Spark decide when to do what? 

Hemant Bhanawat
www.snappydata.io 


On Thu, Mar 31, 2016 at 2:28 PM, ashokkumar rajendran 
 wrote:
Hi,

I have filed ticket SPARK-13900. There was an initial reply from a developer 
but did not get any reply on this. How can we do multiple hash joins together 
for OR conditions based joins? Could someone please guide on how can we fix 
this? 
Regards
Ashok


  

Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-03-31 Thread Ted Yu
I tried this:

scala> final case class Text(id: Int, text: String)
warning: there was one unchecked warning; re-run with -unchecked for details
defined class Text

scala> val ds = Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text]
ds: org.apache.spark.sql.Dataset[Text] = [id: int, text: string]

scala> ds.map(t => t.id).show
+-+
|value|
+-+
|0|
|1|
+-+

On Thu, Mar 31, 2016 at 5:02 AM, Jacek Laskowski  wrote:

> Hi,
>
> I can't seem to use Dataset using case classes (or tuples) to select per
> field:
>
> scala> final case class Text(id: Int, text: String)
> warning: there was one unchecked warning; re-run with -unchecked for
> details
> defined class Text
>
> scala> val ds = Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text]
> ds: org.apache.spark.sql.Dataset[Text] = [id: int, text: string]
>
> // query per field as symbol works fine
> scala> ds.select('id).show
> +---+
> | id|
> +---+
> |  0|
> |  1|
> +---+
>
> // but not per field as Scala attribute
> scala> ds.select(_.id).show
> :40: error: missing parameter type for expanded function
> ((x$1) => x$1.id)
>ds.select(_.id).show
>  ^
>
> Is this supposed to work in Spark 2.0 (today's build)?
>
> BTW, Why is Seq(Text(0, "hello"), Text(1, "world")).as[Text] not possible?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Restart App and consume from checkpoint using direct kafka API

2016-03-31 Thread Cody Koeninger
Long story short, no.  Don't rely on checkpoints if you cant handle
reprocessing some of your data.

On Thu, Mar 31, 2016 at 3:02 AM, Imre Nagi  wrote:
> I'm dont know how to read the data from the checkpoint. But AFAIK and based
> on my experience, I think the best thing that you can do is storing the
> offset to a particular storage such as database everytime you consume the
> message. Then read the offset from the database everytime you want to start
> reading the message.
>
> nb: This approach is also explained by Cody in his blog post.
>
> Thanks
>
> On Thu, Mar 31, 2016 at 2:14 PM, vimal dinakaran 
> wrote:
>>
>> Hi,
>>  In the blog
>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>>
>> It is mentioned that enabling checkpoint works as long as the app jar is
>> unchanged.
>>
>> If I want to upgrade the jar with the latest code and consume from kafka
>> where it was stopped , how to do that ?
>> Is there a way to read the binary object of the checkpoint during init and
>> use that to start from offset ?
>>
>> Thanks
>> Vimal
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re:Re: How to design the input source of spark stream

2016-03-31 Thread 李明伟
Hi Anthony


Thanks. You are right the api will read all files, no need to merge






At 2016-03-31 20:09:25, "Femi Anthony"  wrote:

Also,  ssc.textFileStream(dataDir) will read all the files from a directory so 
as far as I can see there's no need to merge the files. Just write them to the 
same HDFS directory.


On Thu, Mar 31, 2016 at 8:04 AM, Femi Anthony  wrote:

I don't think you need to do it this way.

Take a look here : 
http://spark.apache.org/docs/latest/streaming-programming-guide.html
in this section:

Level of Parallelism in Data Receiving
 Receiving multiple data streams can therefore be achieved by creating multiple 
input DStreams and configuring them to receive different partitions of the data 
stream from the source(s)
These multiple DStreams can be unioned together to create a single DStream. 
Then the transformations that were being applied on a single input DStream can 
be applied on the unified stream.




On Wed, Mar 30, 2016 at 11:08 PM, kramer2...@126.com wrote:
Hi

My environment is described like below:

5 nodes, each nodes generate a big csv file every 5 minutes. I need spark
stream to analyze these 5 files in every five minutes to generate some
report.

I am planning to do it in this way:

1. Put those 5 files into HDSF directory called /data
2. Merge them into one big file in that directory
3. Use spark stream constructor textFileStream('/data') to generate my
inputDStream

The problem of this way is I do not know how to merge the 5 files in HDFS.
It seems very difficult to do it in python.

So question is

1. Can you tell me how to merge files in hdfs by python?
2. Do you know some other way to input those files into spark?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-design-the-input-source-of-spark-stream-tp26641.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org







--

http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre minds." 
- Albert Einstein.





--

http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre minds." 
- Albert Einstein.

Re: Does Spark CSV accept a CSV String

2016-03-31 Thread Mich Talebzadeh
well my guess is just pkunzip it and use bzip2 to zip it or leave it as it
is.

Databricks handles *.bz2 type files. I know that.

Anyway that is the easy part :)

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 31 March 2016 at 01:02, Benjamin Kim  wrote:

> Hi Mich,
>
> I forgot to mention that - this is the ugly part - the source data
> provider gives us (Windows) pkzip compressed files. Will spark uncompress
> these automatically? I haven’t been able to make it work.
>
> Thanks,
> Ben
>
> On Mar 30, 2016, at 2:27 PM, Mich Talebzadeh 
> wrote:
>
> Hi Ben,
>
> Well I have done it for standard csv files downloaded from spreadsheets to
> staging directory on hdfs and loaded from there.
>
> First you may not need to unzip them. dartabricks can read them (in my
> case) and zipped files.
>
> Check this. Mine is slightly different from what you have, First I zip my
> csv files with bzip2 and load them into hdfs
>
> #!/bin/ksh
> DIR="/data/stg/accounts/nw/10124772"
> #
> ## Compress the files
> #
> echo `date` " ""===  Started compressing all csv FILEs"
> for FILE in `ls *.csv`
> do
>   /usr/bin/bzip2 ${FILE}
> done
> #
> ## Clear out hdfs staging directory
> #
> echo `date` " ""===  Started deleting old files from hdfs staging
> directory ${DIR}"
> hdfs dfs -rm -r ${DIR}/*.bz2
> echo `date` " ""===  Started Putting bz2 fileS to hdfs staging
> directory ${DIR}"
> for FILE in `ls *.bz2`
> do
>   hdfs dfs -copyFromLocal ${FILE} ${DIR}
> done
> echo `date` " ""===  Checking that all files are moved to hdfs staging
> directory"
> hdfs dfs -ls ${DIR}
> exit 0
>
> Now you have all your csv files in the staging directory
>
> import org.apache.spark.sql.functions._
> import java.sql.{Date, Timestamp}
> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> println ("\nStarted at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load("
> hdfs://rhes564:9000/data/stg/accounts/nw/10124772")
> case class Accounts( TransactionDate: String, TransactionType: String,
> Description: String, Value: Double, Balance: Double, AccountName: String,
> AccountNumber : String)
> // Map the columns to names
> //
> val a = df.filter(col("Date") > "").map(p =>
> Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))
> //
> // Create a Spark temporary table
> //
> a.toDF.registerTempTable("tmp")
>
> // Need to create and populate target ORC table nw_10124772 in database
> accounts.in Hive
> //
> sql("use accounts")
> //
> // Drop and create table nw_10124772
> //
> sql("DROP TABLE IF EXISTS accounts.nw_10124772")
> var sqltext : String = ""
> sqltext = """
> CREATE TABLE accounts.nw_10124772 (
> TransactionDateDATE
> ,TransactionType   String
> ,Description   String
> ,Value Double
> ,Balance   Double
> ,AccountName   String
> ,AccountNumber Int
> )
> COMMENT 'from csv file from excel sheet'
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="ZLIB" )
> """
> sql(sqltext)
> //
> // Put data in Hive table. Clean up is already done
> //
> sqltext = """
> INSERT INTO TABLE accounts.nw_10124772
> SELECT
>
> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(TransactionDate,'dd/MM/'),'-MM-dd'))
> AS TransactionDate
> , TransactionType
> , Description
> , Value
> , Balance
> , AccountName
> , AccountNumber
> FROM tmp
> """
> sql(sqltext)
>
> println ("\nFinished at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') ").collect.fore
>
> Once you store into a some form of table (Parquet, ORC) etc you can do
> whatever you like with it.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 March 2016 at 22:13, Benjamin Kim  wrote:
>
>> Hi Mich,
>>
>> You are correct. I am talking about the Databricks package spark-csv you
>> have below.
>>
>> The files are stored in s3 and I download, unzip, and store each one of
>> them in a variable as a string using the AWS SDK (aws-java-sdk-1.10.60.jar).
>>
>> Here is some of the code.
>>
>> val filesRdd = sc.parallelize(lFiles, 250)
>> filesRdd.foreachPartition(files => {
>>   val s3Client = new AmazonS3Client(new
>> EnvironmentVariableCredentialsProvider())
>>   files.foreach(file => {
>>

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-31 Thread Eugene Morozov
Joseph,

Correction, there 20k features. Is it still a lot?
What number of features can be considered as normal?

--
Be well!
Jean Morozov

On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley 
wrote:

> First thought: 70K features is *a lot* for the MLlib implementation (and
> any PLANET-like implementation)
>
> Using fewer partitions is a good idea.
>
> Which Spark version was this on?
>
> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> The questions I have in mind:
>>
>> Is it smth that the one might expect? From the stack trace itself it's
>> not clear where does it come from.
>> Is it an already known bug? Although I haven't found anything like that.
>> Is it possible to configure something to workaround / avoid this?
>>
>> I'm not sure it's the right thing to do, but I've
>> increased thread stack size 10 times (to 80MB)
>> reduced default parallelism 10 times (only 20 cores are available)
>>
>> Thank you in advance.
>>
>> --
>> Be well!
>> Jean Morozov
>>
>> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a web service that provides rest api to train random forest algo.
>>> I train random forest on a 5 nodes spark cluster with enough memory -
>>> everything is cached (~22 GB).
>>> On a small datasets up to 100k samples everything is fine, but with the
>>> biggest one (400k samples and ~70k features) I'm stuck with
>>> StackOverflowError.
>>>
>>> Additional options for my web service
>>> spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
>>> spark.default.parallelism = 200.
>>>
>>> On a 400k samples dataset
>>> - (with default thread stack size) it took 4 hours of training to get
>>> the error.
>>> - with increased stack size it took 60 hours to hit it.
>>> I can increase it, but it's hard to say what amount of memory it needs
>>> and it's applied to all of the treads and might waste a lot of memory.
>>>
>>> I'm looking at different stages at event timeline now and see that task
>>> deserialization time gradually increases. And at the end task
>>> deserialization time is roughly same as executor computing time.
>>>
>>> Code I use to train model:
>>>
>>> int MAX_BINS = 16;
>>> int NUM_CLASSES = 0;
>>> double MIN_INFO_GAIN = 0.0;
>>> int MAX_MEMORY_IN_MB = 256;
>>> double SUBSAMPLING_RATE = 1.0;
>>> boolean USE_NODEID_CACHE = true;
>>> int CHECKPOINT_INTERVAL = 10;
>>> int RANDOM_SEED = 12345;
>>>
>>> int NODE_SIZE = 5;
>>> int maxDepth = 30;
>>> int numTrees = 50;
>>> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
>>> maxDepth, NUM_CLASSES, MAX_BINS,
>>> QuantileStrategy.Sort(), new 
>>> scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN,
>>> MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
>>> CHECKPOINT_INTERVAL);
>>> RandomForestModel model = RandomForest.trainRegressor(labeledPoints.rdd(), 
>>> strategy, numTrees, "auto", RANDOM_SEED);
>>>
>>>
>>> Any advice would be highly appreciated.
>>>
>>> The exception (~3000 lines long):
>>>  java.lang.StackOverflowError
>>> at
>>> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
>>> at
>>> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
>>> at
>>> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
>>> at
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at
>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>> at
>>> scala.collection.immutable.$colon$colon.readObject(List.scala:366)
>>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>> at
>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at
>>> 

Re: Spark for Log Analytics

2016-03-31 Thread ashish rawat
Thanks for your replies Steve and Chris.

Steve,

I am creating a real-time pipeline, so I am not looking to dump data to
hdfs rite now. Also, since the log sources would be Nginx, Mongo and
application events, it might not be possible to always route events
directly from the source to flume. Therefore, I thought that "tail -f"
strategy used by fluentd, logstash and others might be the only unifying
solution to collect logs.

Chris,

Can you please elaborate on the Source to Kafka part. Do all event sources
have integration with Kafka. Eg. if you need to send the Server Logs
(Apache/Nginx/Mongo etc) to Kafka, what could be the ideal strategy?

Regards,
Ashish

On Thu, Mar 31, 2016 at 5:16 PM, Chris Fregly  wrote:

> oh, and I forgot to mention Kafka Streams which has been heavily talked
> about the last few days at Strata here in San Jose.
>
> Streams can simplify a lot of this architecture by perform some
> light-to-medium-complex transformations in Kafka itself.
>
> i'm waiting anxiously for Kafka 0.10 with production-ready Kafka Streams,
> so I can try this out myself - and hopefully remove a lot of extra plumbing.
>
> On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly  wrote:
>
>> this is a very common pattern, yes.
>>
>> note that in Netflix's case, they're currently pushing all of their logs
>> to a Fronting Kafka + Samza Router which can route to S3 (or HDFS),
>> ElasticSearch, and/or another Kafka Topic for further consumption by
>> internal apps using other technologies like Spark Streaming (instead of
>> Samza).
>>
>> this Fronting Kafka + Samza Router also helps to differentiate between
>> high-priority events (Errors or High Latencies) and normal-priority events
>> (normal User Play or Stop events).
>>
>> here's a recent presentation i did which details this configuration
>> starting at slide 104:
>> http://www.slideshare.net/cfregly/dc-spark-users-group-march-15-2016-spark-and-netflix-recommendations
>> .
>>
>> btw, Confluent's distribution of Kafka does have a direct Http/REST API
>> which is not recommended for production use, but has worked well for me in
>> the past.
>>
>> these are some additional options to think about, anyway.
>>
>>
>> On Thu, Mar 31, 2016 at 4:26 AM, Steve Loughran 
>> wrote:
>>
>>>
>>> On 31 Mar 2016, at 09:37, ashish rawat  wrote:
>>>
>>> Hi,
>>>
>>> I have been evaluating Spark for analysing Application and Server Logs.
>>> I believe there are some downsides of doing this:
>>>
>>> 1. No direct mechanism of collecting log, so need to introduce other
>>> tools like Flume into the pipeline.
>>>
>>>
>>> you need something to collect logs no matter what you run. Flume isn't
>>> so bad; if you bring it up on the same host as the app then you can even
>>> collect logs while the network is playing up.
>>>
>>> Or you can just copy log4j files to HDFS and process them later
>>>
>>> 2. Need to write lots of code for parsing different patterns from logs,
>>> while some of the log analysis tools like logstash or loggly provide it out
>>> of the box
>>>
>>>
>>>
>>> Log parsing is essentially an ETL problem, especially if you don't try
>>> to lock down the log event format.
>>>
>>> You can also configure Log4J to save stuff in an easy-to-parse format
>>> and/or forward directly to your application.
>>>
>>> There's a log4j to flume connector to do that for you,
>>>
>>>
>>> http://www.thecloudavenue.com/2013/11/using-log4jflume-to-log-application.html
>>>
>>> or you can output in, say, JSON (
>>> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/log/Log4Json.java
>>>  )
>>>
>>> I'd go with flume unless you had a need to save the logs locally and
>>> copy them to HDFS laster.
>>>
>>>
>>>
>>> On the benefits side, I believe Spark might be more performant (although
>>> I am yet to benchmark it) and being a generic processing engine, might work
>>> with complex use cases where the out of the box functionality of log
>>> analysis tools is not sufficient (although I don't have any such use case
>>> right now).
>>>
>>> One option I was considering was to use logstash for collection and
>>> basic processing and then sink the processed logs to both elastic search
>>> and kafka. So that Spark Streaming can pick data from Kafka for the complex
>>> use cases, while logstash filters can be used for the simpler use cases.
>>>
>>> I was wondering if someone has already done this evaluation and could
>>> provide me some pointers on how/if to create this pipeline with Spark.
>>>
>>> Regards,
>>> Ashish
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Chris Fregly*
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
>>
>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>


Re: Spark streaming spilling all the data to disk even if memory available

2016-03-31 Thread Akhil Das
Use StorageLevel MEMORY_ONLY. Also have a look at the createDirectStream
API. Most likely in your case your batch duration must be less than your
processing time and the addition of delay probably blows up the memory.
On Mar 31, 2016 6:13 PM, "Mayur Mohite"  wrote:

> We are using KafkaUtils.createStream API to read data from kafka topics
> and we are using StorageLevel.MEMORY_AND_DISK_SER option while configuring
> kafka streams.
>
> On Wed, Mar 30, 2016 at 12:58 PM, Akhil Das 
> wrote:
>
>> Can you elaborate more on from where you are streaming the data and what
>> type of consumer you are using etc?
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Mar 29, 2016 at 6:10 PM, Mayur Mohite 
>> wrote:
>>
>>> Hi,
>>>
>>> We are running spark streaming app on a single machine and we have
>>> configured spark executor memory to 30G.
>>> We noticed that after running the app for 12 hours, spark streaming
>>> started spilling ALL the data to disk even though we have configured
>>> sufficient memory for spark to use for storage.
>>>
>>> -Mayur
>>>
>>> Learn more about our inaugural *FirstScreen Conference
>>> *!
>>> *Where the worlds of mobile advertising and technology meet!*
>>>
>>> June 15, 2016 @ Urania Berlin
>>>
>>
>>
>
>
> --
> *Mayur Mohite*
> Senior Software Engineer
>
> Phone: +91 9035867742
> Skype: mayur.mohite_applift
>
>
> *AppLift India*
> 107/3, 80 Feet Main Road,
> Koramangala 4th Block,
> Bangalore - 560034
> www.AppLift.com 
>
>
> Learn more about our inaugural *FirstScreen Conference
> *!
> *Where the worlds of mobile advertising and technology meet!*
>
> June 15, 2016 @ Urania Berlin
>


Re: Spark streaming spilling all the data to disk even if memory available

2016-03-31 Thread Mayur Mohite
We are using KafkaUtils.createStream API to read data from kafka topics and
we are using StorageLevel.MEMORY_AND_DISK_SER option while configuring
kafka streams.

On Wed, Mar 30, 2016 at 12:58 PM, Akhil Das 
wrote:

> Can you elaborate more on from where you are streaming the data and what
> type of consumer you are using etc?
>
> Thanks
> Best Regards
>
> On Tue, Mar 29, 2016 at 6:10 PM, Mayur Mohite 
> wrote:
>
>> Hi,
>>
>> We are running spark streaming app on a single machine and we have
>> configured spark executor memory to 30G.
>> We noticed that after running the app for 12 hours, spark streaming
>> started spilling ALL the data to disk even though we have configured
>> sufficient memory for spark to use for storage.
>>
>> -Mayur
>>
>> Learn more about our inaugural *FirstScreen Conference
>> *!
>> *Where the worlds of mobile advertising and technology meet!*
>>
>> June 15, 2016 @ Urania Berlin
>>
>
>


-- 
*Mayur Mohite*
Senior Software Engineer

Phone: +91 9035867742
Skype: mayur.mohite_applift


*AppLift India*
107/3, 80 Feet Main Road,
Koramangala 4th Block,
Bangalore - 560034
www.AppLift.com 

-- 


Learn more about our inaugural *FirstScreen Conference 
*!
*Where the worlds of mobile advertising and technology meet!*

June 15, 2016 @ Urania Berlin


Re: How to design the input source of spark stream

2016-03-31 Thread Femi Anthony
Also,  ssc.textFileStream(dataDir) will read all the files from a directory
so as far as I can see there's no need to merge the files. Just write them
to the same HDFS directory.

On Thu, Mar 31, 2016 at 8:04 AM, Femi Anthony  wrote:

> I don't think you need to do it this way.
>
> Take a look here :
> http://spark.apache.org/docs/latest/streaming-programming-guide.html
> in this section:
> Level of Parallelism in Data Receiving
>  Receiving multiple data streams can therefore be achieved by creating
> multiple input DStreams and configuring them to receive different
> partitions of the data stream from the source(s)
> These multiple DStreams can be unioned together to create a single
> DStream. Then the transformations that were being applied on a single input
> DStream can be applied on the unified stream.
>
>
> On Wed, Mar 30, 2016 at 11:08 PM, kramer2...@126.com 
> wrote:
>
>> Hi
>>
>> My environment is described like below:
>>
>> 5 nodes, each nodes generate a big csv file every 5 minutes. I need spark
>> stream to analyze these 5 files in every five minutes to generate some
>> report.
>>
>> I am planning to do it in this way:
>>
>> 1. Put those 5 files into HDSF directory called /data
>> 2. Merge them into one big file in that directory
>> 3. Use spark stream constructor textFileStream('/data') to generate my
>> inputDStream
>>
>> The problem of this way is I do not know how to merge the 5 files in HDFS.
>> It seems very difficult to do it in python.
>>
>> So question is
>>
>> 1. Can you tell me how to merge files in hdfs by python?
>> 2. Do you know some other way to input those files into spark?
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-design-the-input-source-of-spark-stream-tp26641.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>



-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: How to design the input source of spark stream

2016-03-31 Thread Femi Anthony
I don't think you need to do it this way.

Take a look here :
http://spark.apache.org/docs/latest/streaming-programming-guide.html
in this section:
Level of Parallelism in Data Receiving
 Receiving multiple data streams can therefore be achieved by creating
multiple input DStreams and configuring them to receive different
partitions of the data stream from the source(s)
These multiple DStreams can be unioned together to create a single DStream.
Then the transformations that were being applied on a single input DStream
can be applied on the unified stream.


On Wed, Mar 30, 2016 at 11:08 PM, kramer2...@126.com 
wrote:

> Hi
>
> My environment is described like below:
>
> 5 nodes, each nodes generate a big csv file every 5 minutes. I need spark
> stream to analyze these 5 files in every five minutes to generate some
> report.
>
> I am planning to do it in this way:
>
> 1. Put those 5 files into HDSF directory called /data
> 2. Merge them into one big file in that directory
> 3. Use spark stream constructor textFileStream('/data') to generate my
> inputDStream
>
> The problem of this way is I do not know how to merge the 5 files in HDFS.
> It seems very difficult to do it in python.
>
> So question is
>
> 1. Can you tell me how to merge files in hdfs by python?
> 2. Do you know some other way to input those files into spark?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-design-the-input-source-of-spark-stream-tp26641.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-03-31 Thread Jacek Laskowski
Hi,

I can't seem to use Dataset using case classes (or tuples) to select per field:

scala> final case class Text(id: Int, text: String)
warning: there was one unchecked warning; re-run with -unchecked for details
defined class Text

scala> val ds = Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text]
ds: org.apache.spark.sql.Dataset[Text] = [id: int, text: string]

// query per field as symbol works fine
scala> ds.select('id).show
+---+
| id|
+---+
|  0|
|  1|
+---+

// but not per field as Scala attribute
scala> ds.select(_.id).show
:40: error: missing parameter type for expanded function
((x$1) => x$1.id)
   ds.select(_.id).show
 ^

Is this supposed to work in Spark 2.0 (today's build)?

BTW, Why is Seq(Text(0, "hello"), Text(1, "world")).as[Text] not possible?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Execution error during ALS execution in spark

2016-03-31 Thread pankajrawat
Hi, 

While building Recommendation engine using spark MLlib (ALS) we are facing
some issues during execution. 

Details are below. 

We are trying to train our model on 1.4 million sparse rating records (1,00,
000 customer X 50,000 items). The execution DAG cycle is taking a long time
and is crashing after several hours when executing
model.recommendProductsForUsers() step . The causes of exception are
non-uniform and varied from time to time. 
The common exceptions faced during last 10 runs are 
a)  Akka Timeout
b)  Out of Memory Exceptions
c)  Executor disassociation.

We have tried increasing execution time to 1200 seconds, that doesn’t seem
to create an impact
   sparkConf.set("spark.network.timeout", "1200s");
   sparkConf.set("spark.rpc.askTimeout", "1200s");
   sparkConf.set("spark.rpc.lookupTimeout", "1200s");
   sparkConf.set("spark.akka.timeout", "1200s"); 

Our command line parameters are as follows --num-executors 5
--executor-memory 2G --conf spark.yarn.executor.memoryOverhead=600 --conf
spark.default.parallelism=500 --master yarn

Configuration
1.  3 node cluster,  16 GB RAM, Intel I7 processor. 
2.  Spark 1.5.2

The algorithm is perfectly working for lesser number of
records.

We would appreciate any help in this regard and would like to know following 
1.  How can we handle execution of large records in spark without fail, as
the rating records will increase with time. 
2.  Are we missing any command line parameters that are necessary for this
type of heavy execution.
3.  Does above cluster size and configuration adequate for this many record
processing?  Large amount of time taken during execution is fine, but the
process should not Fail. 
4.  What is exactly meant by Akka timeout error during ALS job execution ? 

Regards,
Pankaj Rawat




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Execution-error-during-ALS-execution-in-spark-tp26644.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark for Log Analytics

2016-03-31 Thread Chris Fregly
oh, and I forgot to mention Kafka Streams which has been heavily talked
about the last few days at Strata here in San Jose.

Streams can simplify a lot of this architecture by perform some
light-to-medium-complex transformations in Kafka itself.

i'm waiting anxiously for Kafka 0.10 with production-ready Kafka Streams,
so I can try this out myself - and hopefully remove a lot of extra plumbing.

On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly  wrote:

> this is a very common pattern, yes.
>
> note that in Netflix's case, they're currently pushing all of their logs
> to a Fronting Kafka + Samza Router which can route to S3 (or HDFS),
> ElasticSearch, and/or another Kafka Topic for further consumption by
> internal apps using other technologies like Spark Streaming (instead of
> Samza).
>
> this Fronting Kafka + Samza Router also helps to differentiate between
> high-priority events (Errors or High Latencies) and normal-priority events
> (normal User Play or Stop events).
>
> here's a recent presentation i did which details this configuration
> starting at slide 104:
> http://www.slideshare.net/cfregly/dc-spark-users-group-march-15-2016-spark-and-netflix-recommendations
> .
>
> btw, Confluent's distribution of Kafka does have a direct Http/REST API
> which is not recommended for production use, but has worked well for me in
> the past.
>
> these are some additional options to think about, anyway.
>
>
> On Thu, Mar 31, 2016 at 4:26 AM, Steve Loughran 
> wrote:
>
>>
>> On 31 Mar 2016, at 09:37, ashish rawat  wrote:
>>
>> Hi,
>>
>> I have been evaluating Spark for analysing Application and Server Logs. I
>> believe there are some downsides of doing this:
>>
>> 1. No direct mechanism of collecting log, so need to introduce other
>> tools like Flume into the pipeline.
>>
>>
>> you need something to collect logs no matter what you run. Flume isn't so
>> bad; if you bring it up on the same host as the app then you can even
>> collect logs while the network is playing up.
>>
>> Or you can just copy log4j files to HDFS and process them later
>>
>> 2. Need to write lots of code for parsing different patterns from logs,
>> while some of the log analysis tools like logstash or loggly provide it out
>> of the box
>>
>>
>>
>> Log parsing is essentially an ETL problem, especially if you don't try to
>> lock down the log event format.
>>
>> You can also configure Log4J to save stuff in an easy-to-parse format
>> and/or forward directly to your application.
>>
>> There's a log4j to flume connector to do that for you,
>>
>>
>> http://www.thecloudavenue.com/2013/11/using-log4jflume-to-log-application.html
>>
>> or you can output in, say, JSON (
>> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/log/Log4Json.java
>>  )
>>
>> I'd go with flume unless you had a need to save the logs locally and copy
>> them to HDFS laster.
>>
>>
>>
>> On the benefits side, I believe Spark might be more performant (although
>> I am yet to benchmark it) and being a generic processing engine, might work
>> with complex use cases where the out of the box functionality of log
>> analysis tools is not sufficient (although I don't have any such use case
>> right now).
>>
>> One option I was considering was to use logstash for collection and basic
>> processing and then sink the processed logs to both elastic search and
>> kafka. So that Spark Streaming can pick data from Kafka for the complex use
>> cases, while logstash filters can be used for the simpler use cases.
>>
>> I was wondering if someone has already done this evaluation and could
>> provide me some pointers on how/if to create this pipeline with Spark.
>>
>> Regards,
>> Ashish
>>
>>
>>
>>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>



-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com


Re: Spark for Log Analytics

2016-03-31 Thread Chris Fregly
this is a very common pattern, yes.

note that in Netflix's case, they're currently pushing all of their logs to
a Fronting Kafka + Samza Router which can route to S3 (or HDFS),
ElasticSearch, and/or another Kafka Topic for further consumption by
internal apps using other technologies like Spark Streaming (instead of
Samza).

this Fronting Kafka + Samza Router also helps to differentiate between
high-priority events (Errors or High Latencies) and normal-priority events
(normal User Play or Stop events).

here's a recent presentation i did which details this configuration
starting at slide 104:
http://www.slideshare.net/cfregly/dc-spark-users-group-march-15-2016-spark-and-netflix-recommendations
.

btw, Confluent's distribution of Kafka does have a direct Http/REST API
which is not recommended for production use, but has worked well for me in
the past.

these are some additional options to think about, anyway.


On Thu, Mar 31, 2016 at 4:26 AM, Steve Loughran 
wrote:

>
> On 31 Mar 2016, at 09:37, ashish rawat  wrote:
>
> Hi,
>
> I have been evaluating Spark for analysing Application and Server Logs. I
> believe there are some downsides of doing this:
>
> 1. No direct mechanism of collecting log, so need to introduce other tools
> like Flume into the pipeline.
>
>
> you need something to collect logs no matter what you run. Flume isn't so
> bad; if you bring it up on the same host as the app then you can even
> collect logs while the network is playing up.
>
> Or you can just copy log4j files to HDFS and process them later
>
> 2. Need to write lots of code for parsing different patterns from logs,
> while some of the log analysis tools like logstash or loggly provide it out
> of the box
>
>
>
> Log parsing is essentially an ETL problem, especially if you don't try to
> lock down the log event format.
>
> You can also configure Log4J to save stuff in an easy-to-parse format
> and/or forward directly to your application.
>
> There's a log4j to flume connector to do that for you,
>
>
> http://www.thecloudavenue.com/2013/11/using-log4jflume-to-log-application.html
>
> or you can output in, say, JSON (
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/log/Log4Json.java
>  )
>
> I'd go with flume unless you had a need to save the logs locally and copy
> them to HDFS laster.
>
>
>
> On the benefits side, I believe Spark might be more performant (although I
> am yet to benchmark it) and being a generic processing engine, might work
> with complex use cases where the out of the box functionality of log
> analysis tools is not sufficient (although I don't have any such use case
> right now).
>
> One option I was considering was to use logstash for collection and basic
> processing and then sink the processed logs to both elastic search and
> kafka. So that Spark Streaming can pick data from Kafka for the complex use
> cases, while logstash filters can be used for the simpler use cases.
>
> I was wondering if someone has already done this evaluation and could
> provide me some pointers on how/if to create this pipeline with Spark.
>
> Regards,
> Ashish
>
>
>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com


Execution error during ALS execution in spark

2016-03-31 Thread Pankaj Rawat
Hi,

While building Recommendation engine using spark MLlib (ALS) we are facing some 
issues during execution.

Details are below.

We are trying to train our model on 1.4 million sparse rating records (1,00, 
000 customer X 50,000 items). The execution DAG cycle is taking a long time and 
is crashing after several hours when executing 
model.recommendProductsForUsers() step . The causes of exception are 
non-uniform and varied from time to time.

The common exceptions faced during last 10 runs are

a)  Akka Timeout

b)  Out of Memory Exceptions

c)   Executor disassociation.

We have tried increasing execution time to 1200 seconds, that doesn't seem to 
create an impact
   sparkConf.set("spark.network.timeout", "1200s");
   sparkConf.set("spark.rpc.askTimeout", "1200s");
   sparkConf.set("spark.rpc.lookupTimeout", "1200s");
   sparkConf.set("spark.akka.timeout", "1200s");

Our command line parameters are as follows --num-executors 5 
--executor-memory 2G --conf spark.yarn.executor.memoryOverhead=600 --conf 
spark.default.parallelism=500 --master yarn

Configuration

1.   3 node cluster,  16 GB RAM, Intel I7 processor.

2.   Spark 1.5.2

The algorithm is perfectly working for lesser number of records.

We would appreciate any help in this regard and would like to know following

1.   How can we handle execution of large records in spark without fail, as 
the rating records will increase with time.

2.   Are we missing any command line parameters that are necessary for this 
type of heavy execution.

3.   Does above cluster size and configuration adequate for this many 
record processing?  Large amount of time taken during execution is fine, but 
the process should not Fail.

4.   What is exactly meant by Akka timeout error during ALS job execution ?

Regards,
Pankaj Rawat


Re: Spark for Log Analytics

2016-03-31 Thread Steve Loughran

On 31 Mar 2016, at 09:37, ashish rawat 
> wrote:

Hi,

I have been evaluating Spark for analysing Application and Server Logs. I 
believe there are some downsides of doing this:

1. No direct mechanism of collecting log, so need to introduce other tools like 
Flume into the pipeline.

you need something to collect logs no matter what you run. Flume isn't so bad; 
if you bring it up on the same host as the app then you can even collect logs 
while the network is playing up.

Or you can just copy log4j files to HDFS and process them later

2. Need to write lots of code for parsing different patterns from logs, while 
some of the log analysis tools like logstash or loggly provide it out of the box



Log parsing is essentially an ETL problem, especially if you don't try to lock 
down the log event format.

You can also configure Log4J to save stuff in an easy-to-parse format and/or 
forward directly to your application.

There's a log4j to flume connector to do that for you,

 http://www.thecloudavenue.com/2013/11/using-log4jflume-to-log-application.html

or you can output in, say, JSON 
(https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/log/Log4Json.java
 )

I'd go with flume unless you had a need to save the logs locally and copy them 
to HDFS laster.



On the benefits side, I believe Spark might be more performant (although I am 
yet to benchmark it) and being a generic processing engine, might work with 
complex use cases where the out of the box functionality of log analysis tools 
is not sufficient (although I don't have any such use case right now).

One option I was considering was to use logstash for collection and basic 
processing and then sink the processed logs to both elastic search and kafka. 
So that Spark Streaming can pick data from Kafka for the complex use cases, 
while logstash filters can be used for the simpler use cases.

I was wondering if someone has already done this evaluation and could provide 
me some pointers on how/if to create this pipeline with Spark.

Regards,
Ashish





Re: Read Parquet in Java Spark

2016-03-31 Thread Ramkumar V
Hi,

Thanks for the reply.  I tried this. It's returning JavaRDD instead of
JavaRDD. How to get JavaRDD ?

Error :
incompatible types:
org.apache.spark.api.java.JavaRDD cannot be
converted to org.apache.spark.api.java.JavaRDD





*Thanks*,



On Thu, Mar 31, 2016 at 2:57 PM, UMESH CHAUDHARY 
wrote:

> From Spark Documentation:
>
> DataFrame parquetFile = sqlContext.read().parquet("people.parquet");
>
> JavaRDD jRDD= parquetFile.javaRDD()
>
> javaRDD() method will convert the DF to RDD
>
> On Thu, Mar 31, 2016 at 2:51 PM, Ramkumar V 
> wrote:
>
>> Hi,
>>
>> I'm trying to read parquet log files in Java Spark. Parquet log files are
>> stored in hdfs. I want to read and convert that parquet file into JavaRDD.
>> I could able to find Sqlcontext dataframe api. How can I read if it
>> is sparkcontext and rdd ? what is the best way to read it ?
>>
>> *Thanks*,
>> 
>>
>>
>


Re: SPARK-13900 - Join with simple OR conditions take too long

2016-03-31 Thread Hemant Bhanawat
Hi Ashok,

That's interesting.

As I understand, on table A and B, a nested loop join (that will produce m
X n rows) is performed and than each row is evaluated to see if any of the
condition is met. You are asking that Spark should instead do a
BroadcastHashJoin on the equality conditions in parallel and then union the
results like you are doing in a different query.

If we leave aside parallelism for a moment, theoretically, time taken for
nested loop join would vary little when the number of conditions are
increased while the time taken for the solution that you are suggesting
would increase linearly with number of conditions. So, when number of
conditions are too many, nested loop join would be faster than the solution
that you suggest. Now the question is, how should Spark decide when to do
what?


Hemant Bhanawat 
www.snappydata.io

On Thu, Mar 31, 2016 at 2:28 PM, ashokkumar rajendran <
ashokkumar.rajend...@gmail.com> wrote:

> Hi,
>
> I have filed ticket SPARK-13900. There was an initial reply from a
> developer but did not get any reply on this. How can we do multiple hash
> joins together for OR conditions based joins? Could someone please guide on
> how can we fix this?
>
> Regards
> Ashok
>


Re: Unable to Run Spark Streaming Job in Hadoop YARN mode

2016-03-31 Thread Ted Yu
Looking through
https://spark.apache.org/docs/latest/configuration.html#spark-streaming , I
don't see config specific to YARN.

Can you pastebin the exception you saw ?

When the job stopped, was there any error ?

Thanks

On Wed, Mar 30, 2016 at 10:57 PM, Soni spark 
wrote:

> Hi All,
>
> I am unable to run Spark Streaming job in my Hadoop Cluster, its behaving
> unexpectedly. When i submit a job, it fails by throwing some socket
> exception in HDFS, if i run the same job second or third time, it runs for
> sometime and stops.
>
> I am confused. Is there any configuration in YARN-Site.xml file specific
> to spark ???
>
> Please suggest me.
>
>
>


Re: Read Parquet in Java Spark

2016-03-31 Thread UMESH CHAUDHARY
>From Spark Documentation:

DataFrame parquetFile = sqlContext.read().parquet("people.parquet");

JavaRDD jRDD= parquetFile.javaRDD()

javaRDD() method will convert the DF to RDD

On Thu, Mar 31, 2016 at 2:51 PM, Ramkumar V  wrote:

> Hi,
>
> I'm trying to read parquet log files in Java Spark. Parquet log files are
> stored in hdfs. I want to read and convert that parquet file into JavaRDD.
> I could able to find Sqlcontext dataframe api. How can I read if it
> is sparkcontext and rdd ? what is the best way to read it ?
>
> *Thanks*,
> 
>
>


Read Parquet in Java Spark

2016-03-31 Thread Ramkumar V
Hi,

I'm trying to read parquet log files in Java Spark. Parquet log files are
stored in hdfs. I want to read and convert that parquet file into JavaRDD.
I could able to find Sqlcontext dataframe api. How can I read if it
is sparkcontext and rdd ? what is the best way to read it ?

*Thanks*,



Re: confusing about Spark SQL json format

2016-03-31 Thread UMESH CHAUDHARY
Hi Charles,
The definition of object from www.json.org:

An *object* is an unordered set of name/value pairs. An object begins with {
 (left brace) and ends with } (right brace). Each name is followed by :
(colon) and the name/value pairs are separated by , (comma).

Its a pretty much OOPS paradigm , isn't it?

Regards,
Umesh

On Thu, Mar 31, 2016 at 2:34 PM, charles li  wrote:

> hi, UMESH, I think you've misunderstood the json definition.
>
> there is only one object in a json file:
>
>
> for the file, people.json, as bellow:
>
>
> 
>
> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
> {"name":"Michael", "address":{"city":null, "state":"California"}}
>
>
> ---
>
> it does have two valid format:
>
> 1.
>
>
> 
>
> [ {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}},
> {"name":"Michael", "address":{"city":null, "state":"California"}}
> ]
>
>
> ---
>
> 2.
>
>
> 
>
> {"name": ["Yin", "Michael"],
> "address":[ {"city":"Columbus","state":"Ohio"},
> {"city":null, "state":"California"} ]
> }
>
> ---
>
>
>
> On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY 
> wrote:
>
>> Hi,
>> Look at below image which is from json.org :
>>
>> [image: Inline image 1]
>>
>> The above image describes the object formulation of below JSON:
>>
>> Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
>> Object=> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>
>>
>> Note that "address" is also an object.
>>
>>
>>
>> On Thu, Mar 31, 2016 at 1:53 PM, charles li 
>> wrote:
>>
>>> as this post  says, that in spark, we can load a json file in this way
>>> bellow:
>>>
>>> *post* :
>>> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
>>>
>>>
>>>
>>> ---
>>> sqlContext.jsonFile(file_path)
>>> or
>>> sqlContext.read.json(file_path)
>>>
>>> ---
>>>
>>>
>>> and the *json file format* looks like bellow, say *people.json*
>>>
>>>
>>> {"name":"Yin",
>>> "address":{"city":"Columbus","state":"Ohio"}}
>>> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>>
>>> ---
>>>
>>>
>>> and here comes my *problems*:
>>>
>>> Is that the *standard json format*? according to http://www.json.org/ ,
>>> I don't think so. it's just a *collection of records* [ a dict ], not a
>>> valid json format. as the json official doc, the standard json format of
>>> people.json should be :
>>>
>>>
>>> {"name":
>>> ["Yin", "Michael"],
>>> "address":[ {"city":"Columbus","state":"Ohio"},
>>> {"city":null, "state":"California"} ]
>>> }
>>>
>>> ---
>>>
>>> So, why we define the json format as a collection of records in spark, I
>>> mean, it will lead to some unconvenient, for if we had a large standard
>>> json file, we need to firstly format it to make it correctly readable in
>>> spark, which will low-efficiency, time-consuming, un-compatible and
>>> space-consuming.
>>>
>>>
>>> great thanks,
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> *--*
>>> a spark lover, a quant, a developer and a good man.
>>>
>>> http://github.com/litaotao
>>>
>>
>>
>
>
> --
> *--*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>


Re: confusing about Spark SQL json format

2016-03-31 Thread charles li
hi, UMESH, I think you've misunderstood the json definition.

there is only one object in a json file:


for the file, people.json, as bellow:



{"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}

---

it does have two valid format:

1.



[ {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}},
{"name":"Michael", "address":{"city":null, "state":"California"}}
]

---

2.



{"name": ["Yin", "Michael"],
"address":[ {"city":"Columbus","state":"Ohio"},
{"city":null, "state":"California"} ]
}
---



On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY 
wrote:

> Hi,
> Look at below image which is from json.org :
>
> [image: Inline image 1]
>
> The above image describes the object formulation of below JSON:
>
> Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
> Object=> {"name":"Michael", "address":{"city":null, "state":"California"}}
>
>
> Note that "address" is also an object.
>
>
>
> On Thu, Mar 31, 2016 at 1:53 PM, charles li 
> wrote:
>
>> as this post  says, that in spark, we can load a json file in this way
>> bellow:
>>
>> *post* :
>> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
>>
>>
>>
>> ---
>> sqlContext.jsonFile(file_path)
>> or
>> sqlContext.read.json(file_path)
>>
>> ---
>>
>>
>> and the *json file format* looks like bellow, say *people.json*
>>
>>
>> {"name":"Yin",
>> "address":{"city":"Columbus","state":"Ohio"}}
>> {"name":"Michael", "address":{"city":null, "state":"California"}}
>>
>> ---
>>
>>
>> and here comes my *problems*:
>>
>> Is that the *standard json format*? according to http://www.json.org/ ,
>> I don't think so. it's just a *collection of records* [ a dict ], not a
>> valid json format. as the json official doc, the standard json format of
>> people.json should be :
>>
>>
>> {"name":
>> ["Yin", "Michael"],
>> "address":[ {"city":"Columbus","state":"Ohio"},
>> {"city":null, "state":"California"} ]
>> }
>>
>> ---
>>
>> So, why we define the json format as a collection of records in spark, I
>> mean, it will lead to some unconvenient, for if we had a large standard
>> json file, we need to firstly format it to make it correctly readable in
>> spark, which will low-efficiency, time-consuming, un-compatible and
>> space-consuming.
>>
>>
>> great thanks,
>>
>>
>>
>>
>>
>>
>> --
>> *--*
>> a spark lover, a quant, a developer and a good man.
>>
>> http://github.com/litaotao
>>
>
>


-- 
*--*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao


SPARK-13900 - Join with simple OR conditions take too long

2016-03-31 Thread ashokkumar rajendran
Hi,

I have filed ticket SPARK-13900. There was an initial reply from a
developer but did not get any reply on this. How can we do multiple hash
joins together for OR conditions based joins? Could someone please guide on
how can we fix this?

Regards
Ashok


Re: confusing about Spark SQL json format

2016-03-31 Thread Hechem El Jed
Hello,

Actually I have been through the same problem as you when I was
implementing a decision tree algorithm with Spark parsing the output to a
comprehensible json format.

So as you said; the correct json format is :
[{
"name": "Yin",
"address": {
"city": "Columbus",
"state": "Ohio"
}
}, {
"name": "Michael",
"address": {
"city": null,
"state": "California"
}
}]

However, I had to consider it as a list such as data[0] to get :

{
"name": "Yin",
"address": {
"city": "Columbus",
"state": "Ohio"
}
}

and then use it for my visualizations.
Spark still a bit tricky when dealing with input/output formats, so I guess
the solution for now, is to create your own parser.


Cheers,

*Hechem El Jed*
Software Engineer & Business Analyst
MY +601131094294
TN +216 24 937 021
[image: View my profile on LinkedIn]


Our environment is fragile, please do not print this email unless necessary.

On Thu, Mar 31, 2016 at 4:23 PM, charles li  wrote:

> as this post  says, that in spark, we can load a json file in this way
> bellow:
>
> *post* :
> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
>
>
>
> ---
> sqlContext.jsonFile(file_path)
> or
> sqlContext.read.json(file_path)
>
> ---
>
>
> and the *json file format* looks like bellow, say *people.json*
>
>
> {"name":"Yin",
> "address":{"city":"Columbus","state":"Ohio"}}
> {"name":"Michael", "address":{"city":null, "state":"California"}}
>
> ---
>
>
> and here comes my *problems*:
>
> Is that the *standard json format*? according to http://www.json.org/ , I
> don't think so. it's just a *collection of records* [ a dict ], not a
> valid json format. as the json official doc, the standard json format of
> people.json should be :
>
>
> {"name":
> ["Yin", "Michael"],
> "address":[ {"city":"Columbus","state":"Ohio"},
> {"city":null, "state":"California"} ]
> }
>
> ---
>
> So, why we define the json format as a collection of records in spark, I
> mean, it will lead to some unconvenient, for if we had a large standard
> json file, we need to firstly format it to make it correctly readable in
> spark, which will low-efficiency, time-consuming, un-compatible and
> space-consuming.
>
>
> great thanks,
>
>
>
>
>
>
> --
> *--*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>


Re: confusing about Spark SQL json format

2016-03-31 Thread UMESH CHAUDHARY
Hi,
Look at below image which is from json.org :

[image: Inline image 1]

The above image describes the object formulation of below JSON:

Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
Object=> {"name":"Michael", "address":{"city":null, "state":"California"}}


Note that "address" is also an object.



On Thu, Mar 31, 2016 at 1:53 PM, charles li  wrote:

> as this post  says, that in spark, we can load a json file in this way
> bellow:
>
> *post* :
> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
>
>
>
> ---
> sqlContext.jsonFile(file_path)
> or
> sqlContext.read.json(file_path)
>
> ---
>
>
> and the *json file format* looks like bellow, say *people.json*
>
>
> {"name":"Yin",
> "address":{"city":"Columbus","state":"Ohio"}}
> {"name":"Michael", "address":{"city":null, "state":"California"}}
>
> ---
>
>
> and here comes my *problems*:
>
> Is that the *standard json format*? according to http://www.json.org/ , I
> don't think so. it's just a *collection of records* [ a dict ], not a
> valid json format. as the json official doc, the standard json format of
> people.json should be :
>
>
> {"name":
> ["Yin", "Michael"],
> "address":[ {"city":"Columbus","state":"Ohio"},
> {"city":null, "state":"California"} ]
> }
>
> ---
>
> So, why we define the json format as a collection of records in spark, I
> mean, it will lead to some unconvenient, for if we had a large standard
> json file, we need to firstly format it to make it correctly readable in
> spark, which will low-efficiency, time-consuming, un-compatible and
> space-consuming.
>
>
> great thanks,
>
>
>
>
>
>
> --
> *--*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>


Spark for Log Analytics

2016-03-31 Thread ashish rawat
Hi,

I have been evaluating Spark for analysing Application and Server Logs. I
believe there are some downsides of doing this:

1. No direct mechanism of collecting log, so need to introduce other tools
like Flume into the pipeline.
2. Need to write lots of code for parsing different patterns from logs,
while some of the log analysis tools like logstash or loggly provide it out
of the box

On the benefits side, I believe Spark might be more performant (although I
am yet to benchmark it) and being a generic processing engine, might work
with complex use cases where the out of the box functionality of log
analysis tools is not sufficient (although I don't have any such use case
right now).

One option I was considering was to use logstash for collection and basic
processing and then sink the processed logs to both elastic search and
kafka. So that Spark Streaming can pick data from Kafka for the complex use
cases, while logstash filters can be used for the simpler use cases.

I was wondering if someone has already done this evaluation and could
provide me some pointers on how/if to create this pipeline with Spark.

Regards,
Ashish


confusing about Spark SQL json format

2016-03-31 Thread charles li
as this post  says, that in spark, we can load a json file in this way
bellow:

*post* :
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html


---
sqlContext.jsonFile(file_path)
or
sqlContext.read.json(file_path)
---


and the *json file format* looks like bellow, say *people.json*

{"name":"Yin",
"address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}
---


and here comes my *problems*:

Is that the *standard json format*? according to http://www.json.org/ , I
don't think so. it's just a *collection of records* [ a dict ], not a valid
json format. as the json official doc, the standard json format of
people.json should be :

{"name":
["Yin", "Michael"],
"address":[ {"city":"Columbus","state":"Ohio"},
{"city":null, "state":"California"} ]
}
---

So, why we define the json format as a collection of records in spark, I
mean, it will lead to some unconvenient, for if we had a large standard
json file, we need to firstly format it to make it correctly readable in
spark, which will low-efficiency, time-consuming, un-compatible and
space-consuming.


great thanks,






-- 
*--*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao


Re: Restart App and consume from checkpoint using direct kafka API

2016-03-31 Thread Imre Nagi
I'm dont know how to read the data from the checkpoint. But AFAIK and based
on my experience, I think the best thing that you can do is storing the
offset to a particular storage such as database everytime you consume the
message. Then read the offset from the database everytime you want to start
reading the message.

nb: This approach is also explained by Cody in his blog post.

Thanks

On Thu, Mar 31, 2016 at 2:14 PM, vimal dinakaran 
wrote:

> Hi,
>  In the blog
> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>
> It is mentioned that enabling checkpoint works as long as the app jar is
> unchanged.
>
> If I want to upgrade the jar with the latest code and consume from kafka
> where it was stopped , how to do that ?
> Is there a way to read the binary object of the checkpoint during init and
> use that to start from offset ?
>
> Thanks
> Vimal
>


Restart App and consume from checkpoint using direct kafka API

2016-03-31 Thread vimal dinakaran
Hi,
 In the blog
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md

It is mentioned that enabling checkpoint works as long as the app jar is
unchanged.

If I want to upgrade the jar with the latest code and consume from kafka
where it was stopped , how to do that ?
Is there a way to read the binary object of the checkpoint during init and
use that to start from offset ?

Thanks
Vimal


Re: No active SparkContext

2016-03-31 Thread Max Schmidt
Just to mark this question closed - we expierienced an OOM-Exception on
the Master, which we didn't see on the Driver, but made him crash.

Am 24.03.2016 um 09:54 schrieb Max Schmidt:
> Hi there,
>
> we're using with the java-api (1.6.0) a ScheduledExecutor that
> continuously executes a SparkJob to a standalone cluster.
>
> After each job we close the JavaSparkContext and create a new one.
>
> But sometimes the Scheduling JVM crashes with:
>
> 24.03.2016-08:30:27:375# error - Application has been killed. Reason:
> All masters are unresponsive! Giving up.
> 24.03.2016-08:30:27:398# error - Error initializing SparkContext.
> java.lang.IllegalStateException: Cannot call methods on a stopped
> SparkContext.
> This stopped SparkContext was created at:
>
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
> io.datapath.spark.AbstractSparkJob.createJavaSparkContext(AbstractSparkJob.java:53)
> io.datapath.measurement.SparkJobMeasurements.work(SparkJobMeasurements.java:130)
> io.datapath.measurement.SparkMeasurementScheduler.lambda$submitSparkJobMeasurement$30(SparkMeasurementScheduler.java:117)
> io.datapath.measurement.SparkMeasurementScheduler$$Lambda$17/1568787282.run(Unknown
> Source)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
>
> The currently active SparkContext was created at:
>
> (No active SparkContext.)
>
> at
> org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:106)
> at
> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1578)
> at
> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2179)
> at org.apache.spark.SparkContext.(SparkContext.scala:579)
> at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
> at
> io.datapath.spark.AbstractSparkJob.createJavaSparkContext(AbstractSparkJob.java:53)
> at
> io.datapath.measurement.SparkJobMeasurements.work(SparkJobMeasurements.java:130)
> at
> io.datapath.measurement.SparkMeasurementScheduler.lambda$submitSparkJobMeasurement$30(SparkMeasurementScheduler.java:117)
> at
> io.datapath.measurement.SparkMeasurementScheduler$$Lambda$17/1568787282.run(Unknown
> Source)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 24.03.2016-08:30:27:402# info - SparkMeasurement - finished.
>
> Any guess?
> -- 
> *Max Schmidt, Senior Java Developer* | m...@datapath.io | LinkedIn
> 
> Datapath.io
>  
> Decreasing AWS latency.
> Your traffic optimized.
>
> Datapath.io GmbH
> Mainz | HRB Nr. 46222
> Sebastian Spies, CEO
>

-- 
*Max Schmidt, Senior Java Developer* | m...@datapath.io
 | LinkedIn

Datapath.io
 
Decreasing AWS latency.
Your traffic optimized.

Datapath.io GmbH
Mainz | HRB Nr. 46222
Sebastian Spies, CEO



Re: Exposing dataframe via thrift server

2016-03-31 Thread Marco Colombo
Is context that is registering the temp table still active?

Il giovedì 31 marzo 2016, ram kumar  ha scritto:

> Hi,
>
> I started thrift server
> cd $SPARK_HOME
> ./sbin/start-thriftserver.sh
>
> Then, jdbc client
> $ ./bin/beeline
> Beeline version 1.5.2 by Apache Hive
> beeline>!connect jdbc:hive2://ip:1
> show tables;
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | check  | false|
> | test   | false|
> ++--+--+
> 5 rows selected (0.126 seconds)
> >
>
> It shows table that are persisted on hive metastore using saveAsTable.
> Temp table (registerTempTable) can't able to view
>
> Can any1 help me with this,
> Thanks
>


-- 
Ing. Marco Colombo